making semantics do some work

45
Making Semantics do Some Work Robert Stevens BioHealth Informatics Group School of Computer Science University of Manchester [email protected]

Upload: robertstevens65

Post on 21-May-2015

113 views

Category:

Science


0 download

DESCRIPTION

keynote talk at practical Semantic Astronomy (SemAst09), glasgow, 2009

TRANSCRIPT

Page 1: Making Semantics do Some Work

Making Semantics do Some Work

Robert StevensBioHealth Informatics GroupSchool of Computer Science

University of [email protected]

Page 2: Making Semantics do Some Work

Introduction

• What’s the use of highly axiomatised ontological descriptions?

• Two use cases:

• Classifying instances based on features: New discoveries;

• Building a complex terminology.

• Cost and benefit.

• Conclusions

Page 3: Making Semantics do Some Work

Protein Classification• Proteins divided into broad functional classes

“Protein Families”• Families sub-divided to give family

classifications• Class membership can be determined by

“protein features”, such as domains, etc.• Resources exist for feature detection via

primary sequence– but not class membership• Current Limitation of Automated Tools• Needs human knowledge to recognise class

membership

Page 4: Making Semantics do Some Work

Finding Domains on a Sequence

A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains

>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).

MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..

Page 5: Making Semantics do Some Work

Why Classify?

• Classification and curation of a genome is the first step in understanding the processes and functions happening in an organism

• Classification enables comparative genomic studies - what is already known in other organisms

• The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology

• In silico characterisation is the current bottleneck

Page 6: Making Semantics do Some Work

Phosphatase Classification

• Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily

• Any protein having a phosphatase domain is a member of the phosphatase super-family

• Other motifs determine a protein’s place within the family

• Usually needs human to recognise that features detected imply class membership

• Can these be captured in an ontology?

Page 7: Making Semantics do Some Work

OWL represents classes of instances

A

BC

Page 8: Making Semantics do Some Work

Necessity and Sufficiency

• An R2A phosphatase must have a fibronectin domain• Having a fibronectin domain does not a phosphatase

make• Necessity -- what must a class instance have? • Any protein that has a phosphatase catalytic domain is a

phosphatase enzyme• All phosphatase enzymes have a catalytic domain• Sufficiency – how is an instance recognised to be a

member of a class?

Page 9: Making Semantics do Some Work

Definition of Tyrosine Phosphatase

Class: TyrosineReceptorProteinPhosphatase

EquivalentTo:

Protein That- contains atLeast 1 ProteinTyrosinePhosphataseDomain and

- contains 1 TransmembraneDomain

Page 10: Making Semantics do Some Work

…there are known knowns; there are things we know we know. We also know there are

known unknowns; that is to say we know there are some things we do not know. But

there are also unknown unknowns -- the ones we don't know we don't know.

Page 11: Making Semantics do Some Work

Definition for R2A Phosphatase

Class: R2AEquivalentTO: Protein That- contains 2 ProteinTyrosinePhosphataseDomain and- contains 1 TransmembraneDomain and - contains 4 FibronectinDomains and- contains 1 ImmunoglobulinDomain and- contains 1 MAMDomain and- contains 1 Cadherin-LikeDomain and- contains only TyrosinePhosphataseDomain or

TransmembraneDomain or FibronectinDomain or ImnunoglobulinDomain or Clathrin-LikeDomain or ManDomain

Page 12: Making Semantics do Some Work

Automated Reasoning

• An OWL-DL ontology mapped to its DL form as a collection of axioms

• An automated reasoner checks for satisfiability – throws out the inconsistent and infers subsumption

• Defined classes (where there are necessary and sufficient restrictions) enable a reasoner to infer subclass axioms

• Also infer to which class an individual belongs

Page 13: Making Semantics do Some Work

Incremental Addition of Protein Functional Domains

Phosphatase catalytic

Cadherin-like

Immunoglobulin

MAM domain Cellular retinaldehyde

Adhesion recognition Transmembrane

Fibronectin III Glycosylation

Page 14: Making Semantics do Some Work

Classification of the Classical Tyrosine Phosphatases

Page 15: Making Semantics do Some Work

What is the Ontology Telling Us?• Each class of phosphatase defined in terms of

domain composition• We know the characteristics by which an

individual protein can be recognised to be a member of a particular class of phosphatase

• We have this knowledge in a computational form• If we had protein instances described in terms of

the ontology, we could classify those individual proteins

• A catalogue of phosphatases

Page 16: Making Semantics do Some Work

Description of an Instance of a Protein

Individual: P21592        

Types: Protein,hasDomain 2 ProteinTyrosinePhosphataseDomain hasdomain 1 TransmembraneDomain,,   hasdomain 4 FibronectinDomain, hasDomain 1 ImmunoglobulinDomain, hasdomain 1 MAMDomain, hasdomain 1 Cadherin-LikeDomain

Page 17: Making Semantics do Some Work

Instance: P21592        TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and  Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain

Tyrosine Phosphatase(containsDomain some TransmembraneDomain) and(containsDomain at least 1 ProteinTyrosinePhosphataseDomain)

R2A Phosphatase(containsDomain some MAMDomain) and(containsDomain some ProteinTyrosineCatalyticDomain or ImmunoglobulinDomain) and(containsDomain some FibronectinDomain or FibronectinTypeIIIFoldDomain) and(containsDomain exactly 2 ProteinTyrosinePhosphataseDomain)

Page 18: Making Semantics do Some Work

Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine

phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..

InterPro

Instance Store

Reasoner

Translate

Codify

Page 19: Making Semantics do Some Work

So Far…..

• Human phosphatases have been classified using the system

• The ontology classification performed equally well as expert classification

• The ontology system refined classification- DUSC contains zinc finger domain Characterised and conserved – but not in classification- DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved

• A new kind of phosphatase?

Page 20: Making Semantics do Some Work

Aspergillus fumigatus• Phosphatase compliment very different from human

>100 human <50 A.fumigatus• Whole subfamilies ‘missing’

Different fungi-specific phosphorylation pathways?No requirement for tissue-specific variations?

• Novel serine/threonine phosphatase with homeobox Conserved in aspergillus and closely related species, but not in any other

Again, a new phosphatase?

Page 21: Making Semantics do Some Work

Generic Technique

• Feature detection

• Categories defined in terms of those features

• Produce catalogue of what you currently know

• Highlight cases that don’t match current knowledge

Page 22: Making Semantics do Some Work

The Cell type Ontology

• Some 880 terms• Describing cell function, lineage,

developmental stage, ploidy, secretion, species,…

• Not explicitly classified according to anatomy• Uses is-a and developsFrom• Used to describe cell types used in

experiments

Page 23: Making Semantics do Some Work

OBO Cell Type Ontology

Page 24: Making Semantics do Some Work

Issues with Current CTO

• History: A need was seen and a few days was spent “lashing” together an ontology by hand

• Contains lots of knowledge• Asserted multiple inheritance: Humans will

make slips and it is difficult• Some biological mistakes• All the knowledge is within the “is-a”

relationships and implicit in the cell names

Page 25: Making Semantics do Some Work

CTO Axes of Classification

• Histology: What cells look like • Lineage: Whence a given cell develops• Ploidy: How sets of chromosomes in a cell• Nucleation: How many nuclei• Secretion & accumulation: What chemicals a

cells secretes or accumulates• Function: What does the cell do• Location: In anatomy• Species: In what taxa does the cell exist• And some others

Page 26: Making Semantics do Some Work

Implicit Knowledge

• Anatomy: muscle cell; red blood cell

• Maturity: immature t-lymphocyte

• Cell surface protein: CD45 positive lymphocyte

• Size

• Shape

Page 27: Making Semantics do Some Work

Problems

• Tangles

• Hard to maintain

• Difficult to add a new cell

• Inflexible queries: What about hormone secreting mesodermal cells?

• Information hidden inside term names

Page 28: Making Semantics do Some Work

A Tangled Ontology of Cars

Page 29: Making Semantics do Some Work

Describing a Big Blue Ford Car

Class: BigBlueFordCar

SubClassOf: Car

that hasColour some Blue

and hasSize some Big

and hasManufacturer some Ford

Page 30: Making Semantics do Some Work

Modules

• Choose a primary axis: In this case Vehicle

• Other axes are represented in separate modules (Colour, Size (qualities) and manufacturer)

• Represent other aspects of classes through restrictions

• (Spot the ontological howler in this toy example)

Page 31: Making Semantics do Some Work

Definition of a Red Car

Class: RedCar

EquivalentTo: Car

that hasColour some Red

• Any car that has the colour red is recognised to be a member of the class RedCar

• The reasoner works it all out and builds the hierarchy for you

Page 32: Making Semantics do Some Work

Normalisation

• This technique of “pulling” apart tangled ontologies is “normalisation”

• Makes for cleaner modelling

• Makes for re-usable components

• The reasoner builds the taxonomy “completely”

• A new car (e.g., yellow Saab” is described and it just appears in the right place

Page 33: Making Semantics do Some Work

What We Did

• Examined CTO

• Chose primary axis of classification

• All other axes added as restrictions on class membership

• Describe cells

• Build ontology

• Use reasoner

Page 34: Making Semantics do Some Work

Ontologies Used

CTO Ontolog

y

PATO Ontology

GO

Biological Process

GO

Cellular Component

SpeciesTaxonomy

Anatomy

Nucleation

Morphology

Size

Ploidy

Muscle ContractionSecretion

Bacillus anthracis str. Ames

ChloroplastCell Membrane

Epithelium

Kidney

Page 35: Making Semantics do Some Work

Mammalian Red Blood Cell

Class: RedBloodCell

SubclassOf: Cell

That hasNucleation some Anucleate

and participatesIn some OxygenTransport

and existsIn some mammalia

and part_of some BloodTissue

and developsFrom some Reticulocyte

Page 36: Making Semantics do Some Work

Mesodermal Lineage Cells

Class: MesodermalLineageCell

EquivalentTo: Cell

That developsFrom some MesodermalCell

(developsFrom is transitive)

Page 37: Making Semantics do Some Work

Spreadsheet

Page 38: Making Semantics do Some Work

Workflow

Spreadsheet CVS OPPL

OWL

Ontology

Reasoned

Ontology

Page 39: Making Semantics do Some Work

Secreting CellsClass: EpinephrinSecretingCell

SubclassOf: Cell

That belongs_to_line some Somatic

and has_nucleation some mononucleate

and has_ploidy some diploid

and potentiality some TerminallyDifferentiated

and participates_in some EpinephrineBiosyntheticProcess

and participates_in EpinephrineSecretion

Class: ProlactinSecretingCell

SubclassOf: Cell

That belongs_to_line some Somatic

and has_nucleation some mononucleate

and has_ploidy some diploid

and potentiality some TerminallyDifferentiated

and participates_in some PeptideHormoneSecretion

and participates_in some ProlactinSecretion

Page 40: Making Semantics do Some Work

Defined CellsClass: SecretoryCell

EquivalentTo: Cell

that participates_in some (secretion or

(part_of some secrection)

Class: EndocrineCell

EquivalentTo: Cell

that participates_in some (EndocrineProcess or

(part_of some EndocrineProcess)

Page 41: Making Semantics do Some Work

Asserted Hierarchy

Page 42: Making Semantics do Some Work

Inferred Hierarchy

Page 43: Making Semantics do Some Work

What We Found

• More subsumption relationships

• The “is-a” hierarchy is complete

• Explicitness made us ask questions

• Found bad structure

• Can just slip in a new cell

• Can make arbitrary queries based on any of the types of axis

Page 44: Making Semantics do Some Work

Conclusions

• Can use strict semantics and automated reasoning to build structurally sound ontologies

• Can catalogue instances and make discoveries• If an object can be recognised by its features

and features can be computationally generated classification can be automated

• High cost and high benefit

Page 45: Making Semantics do Some Work

Acknowledgements

• Katy Wolstencroft did the protein phosphtase work as part of her Ph.D.

• The work on the cell type ontology was udnertaken by members of the EPSRC OntoGenesis Network

• All the ontoogy work at Manchester relies on the support and input of the wider BioHealth and Information Management Groups