making semantics do some work
DESCRIPTION
keynote talk at practical Semantic Astronomy (SemAst09), glasgow, 2009TRANSCRIPT
Making Semantics do Some Work
Robert StevensBioHealth Informatics GroupSchool of Computer Science
University of [email protected]
Introduction
• What’s the use of highly axiomatised ontological descriptions?
• Two use cases:
• Classifying instances based on features: New discoveries;
• Building a complex terminology.
• Cost and benefit.
• Conclusions
Protein Classification• Proteins divided into broad functional classes
“Protein Families”• Families sub-divided to give family
classifications• Class membership can be determined by
“protein features”, such as domains, etc.• Resources exist for feature detection via
primary sequence– but not class membership• Current Limitation of Automated Tools• Needs human knowledge to recognise class
membership
Finding Domains on a Sequence
A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains
>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).
MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..
Why Classify?
• Classification and curation of a genome is the first step in understanding the processes and functions happening in an organism
• Classification enables comparative genomic studies - what is already known in other organisms
• The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology
• In silico characterisation is the current bottleneck
Phosphatase Classification
• Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily
• Any protein having a phosphatase domain is a member of the phosphatase super-family
• Other motifs determine a protein’s place within the family
• Usually needs human to recognise that features detected imply class membership
• Can these be captured in an ontology?
OWL represents classes of instances
A
BC
Necessity and Sufficiency
• An R2A phosphatase must have a fibronectin domain• Having a fibronectin domain does not a phosphatase
make• Necessity -- what must a class instance have? • Any protein that has a phosphatase catalytic domain is a
phosphatase enzyme• All phosphatase enzymes have a catalytic domain• Sufficiency – how is an instance recognised to be a
member of a class?
Definition of Tyrosine Phosphatase
Class: TyrosineReceptorProteinPhosphatase
EquivalentTo:
Protein That- contains atLeast 1 ProteinTyrosinePhosphataseDomain and
- contains 1 TransmembraneDomain
…there are known knowns; there are things we know we know. We also know there are
known unknowns; that is to say we know there are some things we do not know. But
there are also unknown unknowns -- the ones we don't know we don't know.
Definition for R2A Phosphatase
Class: R2AEquivalentTO: Protein That- contains 2 ProteinTyrosinePhosphataseDomain and- contains 1 TransmembraneDomain and - contains 4 FibronectinDomains and- contains 1 ImmunoglobulinDomain and- contains 1 MAMDomain and- contains 1 Cadherin-LikeDomain and- contains only TyrosinePhosphataseDomain or
TransmembraneDomain or FibronectinDomain or ImnunoglobulinDomain or Clathrin-LikeDomain or ManDomain
Automated Reasoning
• An OWL-DL ontology mapped to its DL form as a collection of axioms
• An automated reasoner checks for satisfiability – throws out the inconsistent and infers subsumption
• Defined classes (where there are necessary and sufficient restrictions) enable a reasoner to infer subclass axioms
• Also infer to which class an individual belongs
Incremental Addition of Protein Functional Domains
Phosphatase catalytic
Cadherin-like
Immunoglobulin
MAM domain Cellular retinaldehyde
Adhesion recognition Transmembrane
Fibronectin III Glycosylation
Classification of the Classical Tyrosine Phosphatases
What is the Ontology Telling Us?• Each class of phosphatase defined in terms of
domain composition• We know the characteristics by which an
individual protein can be recognised to be a member of a particular class of phosphatase
• We have this knowledge in a computational form• If we had protein instances described in terms of
the ontology, we could classify those individual proteins
• A catalogue of phosphatases
Description of an Instance of a Protein
Individual: P21592
Types: Protein,hasDomain 2 ProteinTyrosinePhosphataseDomain hasdomain 1 TransmembraneDomain,, hasdomain 4 FibronectinDomain, hasDomain 1 ImmunoglobulinDomain, hasdomain 1 MAMDomain, hasdomain 1 Cadherin-LikeDomain
Instance: P21592 TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain
Tyrosine Phosphatase(containsDomain some TransmembraneDomain) and(containsDomain at least 1 ProteinTyrosinePhosphataseDomain)
R2A Phosphatase(containsDomain some MAMDomain) and(containsDomain some ProteinTyrosineCatalyticDomain or ImmunoglobulinDomain) and(containsDomain some FibronectinDomain or FibronectinTypeIIIFoldDomain) and(containsDomain exactly 2 ProteinTyrosinePhosphataseDomain)
Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine
phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..
InterPro
Instance Store
Reasoner
Translate
Codify
So Far…..
• Human phosphatases have been classified using the system
• The ontology classification performed equally well as expert classification
• The ontology system refined classification- DUSC contains zinc finger domain Characterised and conserved – but not in classification- DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved
• A new kind of phosphatase?
Aspergillus fumigatus• Phosphatase compliment very different from human
>100 human <50 A.fumigatus• Whole subfamilies ‘missing’
Different fungi-specific phosphorylation pathways?No requirement for tissue-specific variations?
• Novel serine/threonine phosphatase with homeobox Conserved in aspergillus and closely related species, but not in any other
Again, a new phosphatase?
Generic Technique
• Feature detection
• Categories defined in terms of those features
• Produce catalogue of what you currently know
• Highlight cases that don’t match current knowledge
The Cell type Ontology
• Some 880 terms• Describing cell function, lineage,
developmental stage, ploidy, secretion, species,…
• Not explicitly classified according to anatomy• Uses is-a and developsFrom• Used to describe cell types used in
experiments
OBO Cell Type Ontology
Issues with Current CTO
• History: A need was seen and a few days was spent “lashing” together an ontology by hand
• Contains lots of knowledge• Asserted multiple inheritance: Humans will
make slips and it is difficult• Some biological mistakes• All the knowledge is within the “is-a”
relationships and implicit in the cell names
CTO Axes of Classification
• Histology: What cells look like • Lineage: Whence a given cell develops• Ploidy: How sets of chromosomes in a cell• Nucleation: How many nuclei• Secretion & accumulation: What chemicals a
cells secretes or accumulates• Function: What does the cell do• Location: In anatomy• Species: In what taxa does the cell exist• And some others
Implicit Knowledge
• Anatomy: muscle cell; red blood cell
• Maturity: immature t-lymphocyte
• Cell surface protein: CD45 positive lymphocyte
• Size
• Shape
Problems
• Tangles
• Hard to maintain
• Difficult to add a new cell
• Inflexible queries: What about hormone secreting mesodermal cells?
• Information hidden inside term names
A Tangled Ontology of Cars
Describing a Big Blue Ford Car
Class: BigBlueFordCar
SubClassOf: Car
that hasColour some Blue
and hasSize some Big
and hasManufacturer some Ford
Modules
• Choose a primary axis: In this case Vehicle
• Other axes are represented in separate modules (Colour, Size (qualities) and manufacturer)
• Represent other aspects of classes through restrictions
• (Spot the ontological howler in this toy example)
Definition of a Red Car
Class: RedCar
EquivalentTo: Car
that hasColour some Red
• Any car that has the colour red is recognised to be a member of the class RedCar
• The reasoner works it all out and builds the hierarchy for you
Normalisation
• This technique of “pulling” apart tangled ontologies is “normalisation”
• Makes for cleaner modelling
• Makes for re-usable components
• The reasoner builds the taxonomy “completely”
• A new car (e.g., yellow Saab” is described and it just appears in the right place
What We Did
• Examined CTO
• Chose primary axis of classification
• All other axes added as restrictions on class membership
• Describe cells
• Build ontology
• Use reasoner
Ontologies Used
CTO Ontolog
y
PATO Ontology
GO
Biological Process
GO
Cellular Component
SpeciesTaxonomy
Anatomy
Nucleation
Morphology
Size
Ploidy
Muscle ContractionSecretion
Bacillus anthracis str. Ames
ChloroplastCell Membrane
Epithelium
Kidney
Mammalian Red Blood Cell
Class: RedBloodCell
SubclassOf: Cell
That hasNucleation some Anucleate
and participatesIn some OxygenTransport
and existsIn some mammalia
and part_of some BloodTissue
and developsFrom some Reticulocyte
Mesodermal Lineage Cells
Class: MesodermalLineageCell
EquivalentTo: Cell
That developsFrom some MesodermalCell
(developsFrom is transitive)
Spreadsheet
Workflow
Spreadsheet CVS OPPL
OWL
Ontology
Reasoned
Ontology
Secreting CellsClass: EpinephrinSecretingCell
SubclassOf: Cell
That belongs_to_line some Somatic
and has_nucleation some mononucleate
and has_ploidy some diploid
and potentiality some TerminallyDifferentiated
and participates_in some EpinephrineBiosyntheticProcess
and participates_in EpinephrineSecretion
Class: ProlactinSecretingCell
SubclassOf: Cell
That belongs_to_line some Somatic
and has_nucleation some mononucleate
and has_ploidy some diploid
and potentiality some TerminallyDifferentiated
and participates_in some PeptideHormoneSecretion
and participates_in some ProlactinSecretion
Defined CellsClass: SecretoryCell
EquivalentTo: Cell
that participates_in some (secretion or
(part_of some secrection)
Class: EndocrineCell
EquivalentTo: Cell
that participates_in some (EndocrineProcess or
(part_of some EndocrineProcess)
Asserted Hierarchy
Inferred Hierarchy
What We Found
• More subsumption relationships
• The “is-a” hierarchy is complete
• Explicitness made us ask questions
• Found bad structure
• Can just slip in a new cell
• Can make arbitrary queries based on any of the types of axis
Conclusions
• Can use strict semantics and automated reasoning to build structurally sound ontologies
• Can catalogue instances and make discoveries• If an object can be recognised by its features
and features can be computationally generated classification can be automated
• High cost and high benefit
Acknowledgements
• Katy Wolstencroft did the protein phosphtase work as part of her Ph.D.
• The work on the cell type ontology was udnertaken by members of the EPSRC OntoGenesis Network
• All the ontoogy work at Manchester relies on the support and input of the wider BioHealth and Information Management Groups