rapid development of an ontology of coriell cell lines chao pang, tomasz adamusiak, helen parkinson...
Post on 31-Dec-2015
214 Views
Preview:
TRANSCRIPT
Rapid Development of an Ontology of Coriell Cell Lines Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone
malone@ebi.ac.uk
Master headline
Overview
• Motivation – our EBI use cases, big stuff in little time
• Methods – normalization, axiomatisation, automation
• Results – large flat ontology, adding structure
• Desiderata
• Future work
Master headline
Motivation
• Lot of large (often online) catalogues
• Infeasible to manually ontologise all of them (and cost is not easy to justify)
Master headline
High quality artefacts take time
33k classes 75k classes
FMA Gene Ontology
3k classes
OBI
30 person years
Since 1999
Since 2006
$money$
Funders
Master headline
Motivation
• Would like to include these in BioSample Database for curating data, querying, browsing
• Ontology use has proven to add value (e.g. GXA)
• RDF representations of data for integration
Master headline
Methods: A Model for Cell Lines
• Agreed with cell line ontology folk
• With addition of is_model_for, to reflect use of cell lines as models for particular diseases (they can be used as models for diseases other than the original disease borne by the subject)
Master headline
Methods: Programmatic Engineering
• Lexical concept recognition (2) - Map the terms to existing ontologies by either class label or synonyms in order to get identifiers (e.g. label – Homo sapiens and synonym – human)
• Produce a list of classes from reference ontologies (3) e.g.• Map organism terms: NCBI taxonomy ontology• Map disease terms: Disease ontology
• Map these to an upper ontology and axiomatic structure (4) cell line model on previous slide
Methods: Our “raw” data• Coriell database dump • Extract information for cell line, cell type, anatomy,
organism, sex and disease. (27002 cell lines)
Lexical Concept Recognition
• Our mapper tool was used, adapted the Levenshtein distance algorithm <http://www.ebi.ac.uk/efo/tools>
• Mapping result:• 93 organism - NCBI taxonomy ontology• 61 anatomy – MAT (species neutral), FMA and other anatomy• 23 cell type – cell ontology• 337 disease - Disease ontology (DO) (via OMIM)
• Check the mapping file manually• Inconsistency e.g. fibroma as anatomy• Unmapped terms• Unclear information e.g. buttock-thigh
Master headline
Using OWL-API to build ontology automatically
buildingOntology.java
Mapping files
Coriell cell line protocol
Reference ontologies Step one:Adding classes
Step two:Adding restriction
Coriell cell line Ontology
Spreadsheet.txt
Example: import class cervix from EFOIt has parent class organism part
It has class relations part_of some female reproductive system
Hela cell line
Master headline
Adding Structure to Normalized Hierarchy
Coriell cell line ontology
Query: human cancer cell line
Master headline
Explanation of query: human cancer cell line
Cancer
humanany cell type any organism part
Cell line
Master headline
Results In Brief
• Large (~28,000) cell line ontology, lots of axiomatisation for the axes mentioned (disease, organism, cell type, anatomy some extra bits like gender)
• Running code took ~5 minutes
• Flat asserted hierarchy under cell line; structure is all inferred using axioms as per Rector Normalization
• Makes more flexible to change and also querying more powerful
Master headline
BioPortal Analysis by Matt Horridge et al• http://bio-ontologies.knowledgeblog.org/135
• “the Coriell Cell Line Ontology, which contains over 45,000 non-trivial entailments, and an average of 4 justifications per entailment, peaking out at 65 justifications for one particular entailment“
• Coriell is quite rich axiomatically – this is what we would expect given the normalization method
• Suggestion rich ontologies can be created rapdily
Master headline
Discussion
• Good bits:• Adding new classes was simple once code was done• Re-running the code also simple if a part of pattern was changed• Can be reused for different ontology with similar sort of input• Axiomatisation gives richer querying and renders implicit knowledge
explicit
• Harder bits:• Some data clean up necessary before using the code (SQL dumps can
be messy)• Eliminating errors from concept matching (though expansion on
synonyms helped)• Textual definitions for all classes • Size of the ontology created a few issues, e.g. BioPortal not able to
browse, most reasoners can not complete on the ontology
Conclusions
• To build very big ontologies we need either• Lots of resource (i.e. money)• Lots of people (crowd-sourcing)• Good tools and methods
• Each having pros and cons• Lots of money requires, well, lots of money• Lots of people requires community buy in and ‘trust’ (so
audit trail crucial) – we’re not always good at that • Automation can introduce errors and require some curation
Master headline
Master headline
Desiderata: Querying and External URIs • Easier to use interfaces, users like it simple
• One of top user feedbacks is “we’d like it to work like Google does”
• Users also like putting URIs into a webpage and getting something sensible back i.e.:
Master headline
Future Work• Creating text definitions
• Will follow up on work by Stevens R, Malone, J et al (2011)* for creating natural language definitions from OWL
• Text mining to extract missing disease information from textual descriptions from original catalogue
• Begin curating data for BioSample Database at EBI using ontology
• Integrate Coriell ontology with rest of process from sample to analysis, e.g. software ontology
http://theswo.sourceforge.net
*Robert Stevens, James Malone, Sandra Williams, Richard Power, Alan Third (2011) Automating generation of textual class definitions from OWL to English.
Journal of Biomedical Semantics, 2011 May 17, Vol. 2 Suppl 2:S5.
Master headline
Acknowledgements
• Thank to Chao Pang, Helen Parkinson,Tomasz Adamusiak, Paulo Silva and all people from functional genomics production team.
• Gen2phen• The Coriell Institute for Medical Research• Alan Ruttenberg and Science Commons for providing the
Coriell SQL dump• Lynn Schriml and colleagues from the Disease Ontology
for OMIM mappings • Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry
Meehan for discussions on the cell line model.
top related