Principles for Building Biomedical
Ontologies Suzanna Lewis
National Center Biomedical Ontology22 October 2005
Advanced Bioinformatics, Cold Spring Harbor
National Center Biomedical Ontology
http://bioontology.org/ Mark Musen
Suzanna Lewis
Barry Smith
Sima Misra
Daniel Rubin
Michael Ashburner
Monte Westerfield
Ida Sim
PI & Core 1: computer science (SMI) Co-PI & Core 2: bioinformatics (BiKR;
GO) Core 6: Outreach and training (ECOR) Associate Program Director Program Director Core 3: Phenotype Project (Cambridge;
FlyBase; and GO) Core 3: Phenotype Project (UOregon; PI
of ZFIN) Core 3: HIV clinical trials Project
(UCSF)
BiKRs
Sima Misra Shu Shengqiang Christopher J. Mungall
Nomi Harris John Day-Richter Karen Eilbeck Mark Gibson
Outline for the Morning
A definition of “ontology” Four sessions:
Organizational Challenges Principles for Ontology Construction
Case Studies from the GO Case Studies for group discussion.
My newbie questions
What data is missing?
What I’ve heard
Where is the data generated?
What is the motivation?
How will it be gathered?
Organism, environment, data quality and attribution TIGR, Sanger, JGI, and coming soon to a 954 near you!
Still an issue. Low threshold of effort relative to benefits of complying Data it is accumulating on disks across the world and we’d like to be able to locate and use it
The hardest part: Sharing (semantics)
handy ontology tells us what’s there…
Where should I eat…?
Ontologies help with decision making
Type of cuisine (Presumable) country of origin
Ontologies don’t just organize data; they also facilitate inference, and that creates new knowledge, often unconsciously in the user.
Where delicatessen food hails from…
‘Frozen Yogurt’ cuisine in search of a national identity?
What a computer would likely infer about the world from this helpful ontology:
Flag of fresh juiceFresh Juice is a national cuisine…
Ontology is all about meaning
Communities form (scientific) theories that seek to explain all of the existing evidence
and can be used for prediction We make inferences and decisions based upon what we know about (biological) reality.
Make our meanings clear enough for a
computer to understand An ontology is a computable representation of this underlying (biological) reality.
An ontology enables a computer to reason over the data in (some of) the ways that we do particularly to query and locate relevant data.
A shared, common, backbone taxonomy of relevant entities, and the relationships between them, within an application domain. Referred to by information scientists as an ’Ontology'.
But really…
What is an Ontology? From Aristotle to Artificial Intelligence
It is ”a formalism of what exists” Follows formal rules for creating definitions originally laid down by Aristotle.
A definition is: the specification of the essence (nature, invariant structure) shared by all the members of a class or natural kind.
The Aristotelian Methodology
Topmost nodes are the undefinable primitives.
The definition of a class lower down in the hierarchy is provided by specifying the parent of the class together with the relevant differentia.
Differentia tells us what marks out instances of the defined class within the wider parent class as in
Plasma membrane is a cell part [immediate parent] that surrounds the cytoplasm [differentia]
Siamese
mammal
cat
organism
Physical object (substance)
classes
animal
instances
frogleaf class
all members of the class frog share a froggy nature
Thorax
Lung
Heart
Cell
Anatomical structures
Cornelius Rosse
Content of FMA
Challenge:Duplicate graphical model in symbolic model
Adapted fromAdapted fromBloom & Fawcett: Bloom & Fawcett:
Textbook of Textbook of Histology Histology
1994 12th ed1994 12th edChapman & HallChapman & Hall
Universals or classes:Kinds of anatomical entities
Content of FMA
So you want an ontology…
What do you have to do to make/get/use/steal/beg one?
Why Survey
Improve
Domain covered
?
Public?
Active?
Applied?
Community?
DevelopSalvage
Collaborate & Learn
yes
no
What you must do
Justify exactly why there is a need Scope it very, very tightly
Communicate with people
The decisions you must make
What domain does it cover? It is privately held? Is it active? Is it applied?
Why Survey
Improve
Domain covered
?
Public?
Active?
Applied?
Community?
DevelopSalvage
Collaborate & Learn (Listen to Barry)
yes
no
Due diligence & background research
Step 1: Learn what is out there The most comprehensive list is on the OBO site. http://obo.sourceforge.net
Assess ontologies critically and realistically.
Make contact
Why Survey
Improve
Domain covered
?
Public?
Active?
Applied?
Community?
DevelopSalvage
Collaborate & Learn (Listen to Barry)
yes
no
Ontologies must be shared
Proprietary ontologies Belief that ownership of the terminology gives the owners a competitive edge
For example, Incyte or Monsanto in the past, SNOMED for non-US.
Data cannot be shared if the ontologies describing the data are not shared.
Don’t reinvent—Use the power of combination and collaboration
Why Survey
Improve
Domain covered
?
Public?
Active?
Applied?
Community?
DevelopSalvage
Collaborate & Learn (Listen to Barry)
yes
no
Pragmatic assessment of an ontology
Is there access to help, e.g.:[email protected] ?
Does a warm body answer help mail within a ‘reasonable’ time—say 2 working days ?
Why Survey
Improve
Domain covered
?
Public?
Active?
Applied?
Community?
DevelopSalvage
Collaborate & Learn (Listen to Barry)
yes
no
Use it to improve it
Every ontology improves when it is applied to actual data
It improves even more when these data are used to answer questions
There will be fewer problems in the ontology and more commitment to fixing remaining problems when important research data is involved that scientists depend upon
Be very wary of ontologies that have never been applied
Work with that community To improve (if you found one) To develop (if you did not)
Getting it right It is impossible to get it right the 1st (or 2nd, or 3rd, …) time.
What we know about reality is continually growing
Improve
Collaborate and Learn
Implication: “prepare for change”
Establish a mechanism for change. Use CVS or Subversion. Changes must be reviewed by experts
Unique Identifiers Versions Archives
Ontology development is hard
Have a stake in seeing it work. Have broad, detailed domain knowledge.
Will engage in vigorous debate without engaging egos.
Will do concrete work and attend frequent working sessions (quarterly), phone conferences (weekly), e-mail correspondence (daily).
2. Principles for Ontology Construction
Why do we need rules for good ontology?
Ontologies must be intelligible to humans (for annotation) and to machines (for reasoning and error-checking)
Unintuitive rules for classification lead to entry errors (problematic links)
Facilitate training of curators Overcome obstacles to alignment with other ontology and terminology systems
Enhance harvesting of content through automatic reasoning systems
Following basic rules makes more useful ontologies
Aristotle’s categoriesThis is Aristotle’s list of types of predication, that is, the different ways in which things can be said to be. He identifies 10 mutually exclusive categories.
1. Substance.2. Quantity.3. Quality.4. Relation.5. Location.6. Time.7. Position.8. Possession. 9. Doing.10.Undergoing.
SNOMED-CT Top Level Substance Body Structure Specimen Context-Dependent
Categories* Attribute Finding* Staging and Scales Organism Physical Object
Events Environments and
Geographic Locations Qualifier Value Special Concept* Pharmaceutical and
Biological Products Social Context Disease Procedure Physical Force
Examples of Rules
Don’t confuse instances with universals Your navel (instance) is not the abstract representation of all navels
Your microarray result is not the abstract representation of all microarray results
The meaning of an ontology should not change when the programming language changes
First Rule: Univocity
Terms (including those describing relations) should have the same meanings on every occasion of use.
In other words, they should refer to the same kinds of instances in reality
Example of univocity problem in case of part_of relation
(Old) Gene Ontology: ‘part_of’ = ‘may be part of’
flagellum part_of cell ‘part_of’ = ‘is at times part of’
replication fork part_of the nucleoplasm
‘part_of’ = ‘is included as a sub-list in’
Second Rule: Positivity
Complements of classes are not themselves classes.
Terms such as ‘non-mammal’, or ‘non-frog’, or ‘non-membrane’ do not designate genuine classes.
Third Rule: Objectivity
Which classes exist is not a function of our biological knowledge.
Terms such as ‘unknown’ or ‘unclassified’ do not designate biological natural kinds.
Fourth Rule: Single Inheritance
No class in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
I.e. no diamonds
Cis_a2
Bis_a1
A
Following the single inheritance rule
The position of a term within the hierarchy enriches its own definition by incorporating automatically the definitions of all the terms above it.
The entire information content of the term hierarchy can be translated very cleanly into a computer representation
B C
is_a1 is_a2
A
‘is_a’ no longer univocal
Problems with multiple inheritance
Fifth Rule: Clarity of Text Definitions
The terms used in a definition should be simpler (more intelligible) than the term to be defined
otherwise the definition provides no assistance to human understanding
Machines can cope with the full formal representation (it doesn’t need the text)
Sixth Rule: Basis in Reality
When building or maintaining an ontology, always think carefully about how classes (types, kinds, species) relate to instances in reality
Axioms governing instances Every class has at least one instance (exceptions will occur at top levels)
Each child class has a smaller collection of instances than its parent class
Axiom: Every parent class has at least two children
The reason that rules are important:
Interoperability Ontologies should work together Avoid redundancy in ontology building
Support reuse Ontologies should be capable of being used by other ontologies (cumulation)
The problem of ontology re-use
SNOMEDMeSHUMLSNCIT HL7-RIM …
None of these have clearly defined relations
Still remain too much at the level of TERMINOLOGY
Not based on a common set of rules
Not based on a common set of relations
An example of unclear relationship use
A is_a B ‘A’ is more specific in meaning than ‘B’
HL7-RIM: Individual Allele is_a Act of Observation
cancer documentation is_a cancer disease prevention is_a disease
How to define A is_a B
A is_a B = def.
• A and B are names of universals (natural kinds, types) in reality
• all instances of A are as a matter of biological science also instances of B
Benefits of well-defined relationships
If the relations in an ontology are well-defined, then reasoning can cascade from one relational assertion (A R1 B) to the next (B R2 C). Relations used in ontologies thus far have not been well defined in this sense.
Find all DNA binding proteins should also find all transcription factor proteins because Transcription factor is_a DNA binding protein
Biomedical data integration /
interoperability Will never be achieved through integration of meanings or concepts
The problem: different user communities use different concepts
What is really needed is a well-defined, commonly used set of relationships
Seventh Rule: Distinguish Universals
and Instances A good ontology must distinguish clearly between universals (types, kinds, classes)
and instances (tokens, individuals, particulars)
Why distinguish classes from instances?
What holds on the level of instances may not hold on the level of universals
For example, my definition of an “adjacent_to” relation requires that it work in either direction
(This particular) nucleus adjacent_to (this particular) cytoplasm Always true
Cytoplasm adjacent_to nucleus Not always true
Using relations
Between classes: is_a, part_of, ...
Between an instance and a class: this explosion instance_of the class explosion
Between instances: Mary’s heart part_of Mary
Relations must be defined to always work
Defining the part_of relation can be a
problem part_of as a relation between classes versus part_of as a relation between instances
nucleus part_of cell (classes) your heart part_of you (instances)
testis part_of human being ? heart part_of human being ? human being has_part human testis ?
Similar considerations are required to clearly define
nearly all relations A causes B A is_located in B A is_adjacent_to B A derives_from B
Zygote derives_from ovum, sperm
A transformation_of B Adult transformation_of child
The Rules1. Univocity: Terms should have the same
meanings on every occasion of use2. Positivity: Terms such as ‘non-mammal’ or
‘non-membrane’ do not designate genuine classes.
3. Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
4. Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level
5. Intelligibility of Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined
6. Basis in Reality: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality
7. Distinguish Classes and Instances
Some rules are Rules of Thumb
The world is full of difficult trade-offs
The benefits of formal (logical and ontological) rigor need to be balanced Against the constraints of computer tractability,
Against the needs of biomedical practitioners.
BUT do the very best you can!
3. Case Studies from the GO
http://www.geneontology.org
How has GO dealt with some specific aspects of ontology
development? Univocity Positivity Objectivity Definitions
Formal definitions Written definitions
Ontology Re-use (Alignment)
Tactile senseTactionTactition
?
The Challenge of Univocity:People call the same thing by
different names
Tactile senseTactionTactition
perception of touch ; GO:0050975
Univocity: GO uses one term and many characterized
synonyms
= bud initiation
= bud initiation
= bud initiation
The Challenge of Univocity: People use the same words to describe different things
Bud initiation? How is a computer to know?
= bud initiation
sensu Metazoa
= bud initiation
sensu Saccharomyces
= bud initiation
sensu Viridiplantae
Univocity: GO adds “sensu” descriptors to discriminate
among organisms
The Challenge of Positivity
Some organelles are membrane-bound.A centrosome is not a membrane bound organelle,but it still may be considered an organelle.
The Challenge of Positivity: Sometimes absence is a
distinction in a Biologist’s mind
non-membrane-bound organelle
GO:0043228 membrane-bound organelle
GO:0043227
Positivity
Note the logical difference between “non-membrane-bound organelle” and “not a membrane-bound organelle”
The latter includes everything that is not a membrane bound organelle!
The Challenge of Objectivity: Database users want to know if
we don’t know anything (Exhaustiveness with respect
to knowledge)
We don’t know anything about a gene product with
respect to these
We don’t know anything about the ligand that
binds this type of GPCR
Objectivity
How can we use GO to annotate gene products when we know that we don’t have any information about them? Currently GO has terms in each ontology to describe unknown
An alternative might be to annotate genes to root nodes and use an evidence code to describe that we have no data.
Similar strategies could be used for things like receptors where the ligand is unknown.
GPCRs with unknown ligands
We could annotate to
this
GO DefinitionsA definition written by
a biologist:necessary & sufficient
conditions written definition(not computable)
Graph structure: necessary conditions
formal(computable)
Relationships and definitions
Important considerations: Placement in the graph- selecting parents
Appropriate relationships to different parents
True path violation
True path violationWhat is it?
..”the path from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Mitochondrial chromosome
Is_a relationship
Part_of relationship
nucleus
True path violationWhat is it?
nucleus chromosome
Nuclear chromosome
Mitochondrial chromosome
Is_a relationshipsPart_of relationship
The Importance of synonyms:is tRNA a function?
Molecular_function
Triplet codon amino acid adaptor activity
GO Definition: Mediates the insertion of an amino acid at the correct point in the sequence of a nascent polypeptide chain during protein synthesis.
Synonym: tRNA
Ontology integrationOne of the current goals of GO
is integration
cone cell fate commitment
retinal_cone_cell keratinocyte differentiation keratinocyte adipocyte differentiation fat_cell
dendritic cell activation dendritic_cell
lymphocyte proliferation lymphocyte
T-cell homeostasis T_lymphocyte
garland cell differentiation garland_cell
heterocyst cell differentiation heterocyst
References to Cell Types in GO
Cell Types in the Cell Ontology
with
We can integrate the GO with other ontologies
Chemical ontologies 3,4-dihydroxy-2-butanone-4-phosphate synthase activity
Anatomy ontologies metanephros development
GO itself mitochondrial inner membrane peptidase activity
Nota bene: some time and effort will be required
Building Ontology
Improve
Collaborate and Learn
Applied Ontology: a summary
Dedicated editors Practice good ontological hygiene Engage the community Reward compliance and get the ontology into use
Plan for change over time KISS: Concentrate on what you can definitely agree upon: the steps you can take with certainty.
4. Case Studies for group discussion
mitosis and meiosis It's been a full lunar cycle since we last talked about this
on the mailing list, and I would like to draw everyone's attention once again to the exciting topics of chromosome segregation, nuclear division and cell division. The basic problem is the multiplicity of meanings attached to 'mitosis'. The word are used in the literature and colloquially to represent everything from chromosome segregation up to a full round of nuclear and cell division and there is no consensus on how to define it in scientific or general dictionaries (check www.onelook.com for proof). To compound the problem, the only process common to all species which undergo 'mitosis' is chromosome segregation; not all species undergo nuclear division or cell division during the processes described in the literature as 'mitosis'. In the ontologies, we currently have 'mitosis' defined as chromosome segregation and nuclear division. This is therefore wrong for those species in which there is no nuclear division accompanying chromosome segregation. How are we going to define mitosis?
Events of the mitotic cell cycle that need to be represented: mitotic chromosome segregation mitotic nuclear division mitotic cell division
Only component common to all these is mitotic chromosome segregation.
Structure must be flexible enough to accommodate any of the flavors of 'mitosis’, no matter what the species and no matter whether the annotator has read the definition or not.
Backing up assertions
QUESTION: What evidence code is appropriate to use for statements of “common knowledge”?
The current documentation states that TAS may be used as the evidence code for statements of common knowledge. For example, let’s say you have a paper that says that Protein X is an xxxxx , with a direct assay for activity, so you can use IDA for this function term. Then it also makes a mutation in the gene for Protein X and shows that it is involved in process yyyy, so you can use IMP for the process term. But, the paper does not have any direct evidence about the localization of Protein X. However, everyone knows that process yyyy occurs in the cytoplasm, so you can annotate protein X to the component term “cytoplasm ; GO:5737” by TAS using a general reference like Biochemistry by Lupert Stryer.
There is not really a traceable statement in Stryer providing evidence that process yyyy occurs in this location in yeast.
SGD feels that it is better to use the newer evidence code IC for these “common” knowledge types of annotations. Thus, if an SGD curator felt that it was reasonable to make the annotation “cytoplasm” based on the knowledge that Protein X the process annotation yyyy, then the curator could assign the component term “cytoplasm ; GO:5737” using IC and the GOid of the process term yyyyy.
many of these “common knowledge” types of statements are often not well based in actual experiments conducted on the organism of interest, that early biochemists would often perform experiments with materials that were easy to obtain, e.g. calf thymus, and assume that this accurately represented the situation for another organism, e.g. human. This may or may not be the case.
What is the most appropriate GO term for annotating a
response to methylmercury? "Response to mercury ion" doesn't seem quite right, as it specifically states that the response is "as a result of exposure to mercuric ions (Hg2+)", but the more general-sounding "response to mercury" is a synonym of it. In the publication I am working on, they exposed zebrafish to methylmercury and documented the resulting changes in gene expression.
"Response to mercury ion
Definition: A change in state or activity of the organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of exposure to mercuric ions (Hg2+).
Synonyms: response to mercuric, response to mercury
Homeobox => DNA binding?
http://www.geneontology.org/email-annotation/annotation-arc/annotation-2005/0208.html
Bloggers and other online groups (eg. del.icio.us, Flickr [online photo archive], Technorati) have been self-categorizing or 'tagging' web sites and their content using user-defined words and phrases and not an expertly curated vocabulary or ontology. The end result is that a vast amount of content has been indexed using a rich vocabulary of tags (to date, technorati has over 1.2 billion links tagged with 1.2 million tags).
Whilst this certainly lacks the formal consistency that would be obtained with curated annotation against a standard vocabulary, the quantity of content being categorized far exceeds what could be done by a group of annotators and perhaps is richer because the tags are defined by the users and creators of that content, not by a third party interpreting the material after the fact.
Given the ever increasing quantity of scientific data, the proliferation of online publishing, etc., could scientists tagging their own data with their own terms be the way to go?
How can you recruit and train people, in both logic and biology, given that without a sufficient number of competent personnel the ontology cannot be maintained?
Thanks to NIH and HHMI for funding and
supportAnd to my fantastic
colleagues (whose slides these are)
MICHAEL ASHBURNTER, BARRY SMITH, DAVID HILL,
CORNELIUS ROSSE & CHRIS MUNGALL
P.S. Graphical User Interfaces
Semantics
Common pitfalls
Don’t confuse instances with artifacts of your database representation...
part_of
part_of must be time-indexed for spatial classes
A part_of B is defined as: Given any instance a and any time t, If a is an instance of the universal A at t,
then there is some instance b of the universal B
such that a is an instance-level part_of b at t
C
c at t
C1
c1 at t1
C'
c' at t
time
instances
derives_from
derives_fromovumsperm
zygote
c at t1
C
c at t
C1
time
same instance
transformation_of
pre-RNA mature RNA
adultchild
C2 transformation_of C1 is defined as Given any instance c of C2
c was at some earlier time an instance of C1
embryological development C
c at t c at t1
C1
C
c at t c at t1
C1
tumor development
Key
In the following discussion: Classes are in upper case
‘A’ is the class Instances are in lower case
‘a’ is a particular instance
Placement in the graph
Example- Proteasome complex