the real world of ontologies and phenotype representation: perspectives from the neuroscience...

42
The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework Maryann Martone, Ph. D. University of California, San Diego

Upload: neuroscience-information-framework

Post on 26-Jun-2015

1.090 views

Category:

Health & Medicine


2 download

DESCRIPTION

Presentation to the RCN Phenotype Summit meeting

TRANSCRIPT

Page 1: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

The real world of ontologies and phenotype

representation: perspectives from the

Neuroscience Information Framework

Maryann Martone, Ph. D.University of California, San Diego

Page 2: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

“Neural Choreography”

“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits-- Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....

However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “

Akil et al., Science, Feb 11, 2011

Page 3: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

“Data choreography” In that same issue of Science

Asked peer reviewers from last year about the availability and use of data About half of those polled store their data only in their

laboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and

archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving

And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.

“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )

Page 4: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials,

services) are available to the neuroscience community?

How many are there? What domains do they cover? What domains do

they not cover? Where are they?

Web sites Databases Literature Supplementary material

Who uses them? Who creates them? How can we find them? How can we make them better in the future?

http://neuinfo.org

• PDF files

• Desk drawers

Page 5: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

In an ideal world...

We’d like to be able to find:What is known****:

What is the average diameter of a Purkinje neuron

Is GRM1 expressed In cerebral cortex? What are the projections of hippocampus? What genes have been found to be

upregulated in chronic drug abuse in adults Is alpha synuclein in the striatum? What studies used my polyclonal antibody

against GABA in humans? What rat strains have been used most

extensively in research during the last 20 years?

What is not known: Connections among data Gaps in knowledgeWithout some sort of framework, very

difficult to do

Required Components:– Query interface– Search strategies

– Data sources– Infrastructure– Results display

– Why did I get this result?

– Analysis tools

Page 6: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

The Neuroscience Information Framework: Discovery and utilization of web-based

resources for neuroscience

A portal for finding and using neuroscience resources

A consistent framework for describing resources

Provides simultaneous search of multiple types of information, organized by category

Supported by an expansive ontology for neuroscience

Utilizes advanced technologies to search the “hidden web”

http://neuinfo.org

UCSD, Yale, Cal Tech, George Mason, Washington Univ

Supported by NIH Blueprint

Literature

Database Federation

Registry

Page 7: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

We need more databases !?

•NIF Registry: A catalog of neuroscience-relevant resources

• > 5000 currently listed

• > 2000 databases

•And we are finding more every day

Page 8: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

NIF must work with ecosystem as it is today

NIF was one of the first projects to attempt data integration in the neurosciences on a large scale

NIF is supported by a contract that specified the number of resources to be added per year Designed to be populated rapidly; set up process for

progressive refinement No budget was allocated to retrofit existing resources;

had to work with them in their current state We designed a system that required little to no

cooperation or work from providers NIF was required to assemble (not create) ontologies very fast

and to provide a platform through which the community could view, comment and add NIF is enriched by ontologies but does not depend on them Took advantage of community ontologies But needed to take a very pragmatic and aggressive approach to

incorporating and using them Neurolex semantic wiki

Page 9: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

What are the connections of the hippocampus?

Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion:

Synonyms and related concepts

Boolean queriesData sources

categorized by “data type” and level of nervous

system

Common views across multiple

sources

Tutorials for using full

resource when getting there

from NIF

Link back to record in

original source

Page 10: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Imminent: NIF 5.0

NIF 5.0 about to be released

New design New query

features New

analytics

Page 11: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

What do you mean by data?Databases come in many shapes and

sizes Primary data:

Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

Secondary data Data features extracted

through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)

Tertiary data Claims and assertions

about the meaning of data E.g., gene

upregulation/downregulation, brain activation as a function of task

Registries: Metadata Pointers to data sets or

materials stored elsewhere Data aggregators

Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede

Single source Data acquired within a

single context , e.g., Allen Brain Atlas

Researchers are producing a variety of information resources using a multitude of technologies

Page 12: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Exploration: Where is alpha synuclein?

•Spatially:• Gene• Protein

• Subcellular• Cellular• Regional• Organism

•Semantically:• Gene regulation

networks• Protein pathways• Cellular local

connectivity• Regional connectivity• Who is studying it?• Who is funding its

study?Networks exist across scales; all important in the nervous system

Page 13: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Set of modular ontologies 86, 000 + distinct

concepts + synonyms Bridge files between

modules Expressed in OWL-DL

language Currently supports OWL

2 Tries to follow OBO

community best practices Standardized to the

same upper level ontologies e.g., Basic Formal

Ontology (BFO), OBO Relations Ontology (OBO-RO),

Imports existing community ontologies e.g., CHEBI, GO,

PRO, DOID, OBI etc.

Retains identifiers in most recent additions but reflects history

13

Covers major domains of neuroscience: Organisms, Brain Regions, Cells,

Molecules, Subcellular parts, Diseases, Nervous system functions,

Techniques

NIFSTD Ontologies

Fahim Imam, William Bug

Page 14: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

“Search computing”: Query by concept

What genes are upregulated by drugs of abuse in the adult mouse? (show

me the data!) MorphineIncreased expression

Adult Mouse

Reasonable standards make it easy to search for and compare results

Page 15: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Diseases of nervous system

New: Data analytics

NIF is in a unique position to answer questions about the neuroscience ecosystem using new analytics tools

Neuro

deg

enera

tive

Seizu

re d

isord

ers

Neopla

stic dise

ase

of n

erv

ous

syste

m

NIH Reporte

r

NIF

data

federa

ted

sou

rces

Page 16: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Results are organized within a common framework

Connects to

Synapsed with

Synapsed by

Input region

innervates

Axon innervates

Projects to

Cellular contact

Subcellular contact

Source site

Target site

Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases

Page 17: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

NIF Concept Mapper

Page 18: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

The scourge of neuroanatomical nomenclature: Importance of NIF

semantic framework•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385

Page 19: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Why so many names? The brain is perhaps unique among major organ

systems in the multiplicity of naming schemes for its major and minor regions.

The brain has been divided based on topology of major features, cyto- and myelo-architecture, developmental boundaries, supposed evolutionary origins, histochemistry, gene expression and functional criteria.

The gross anatomy of the brain reflects the underlying networks only superficially, and thus any parcellation reflects a somewhat arbitrary division based on one or more of these criteria.

The “activation map” images that commonly accompany brain imaging papers can be misleading to inexperienced readers, by seeming to suggest that the boundaries between “activated” and “unactivated” patches of cortex are unambigous and sharp. Instead, as most researchers are aware, the apparent sharp boundaries are subject to the choice of threshold applied to the statistical tests that generate the image. What, then, justifies dividing the cortex into regions with boundaries based on this fuzzy, mutable measure of functional profile?(Saxe et al., 2010, p. 39). Brainmaps.org

Page 20: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Program on Ontologies for Neural Structures

International Neuroinformatics Coordinating Committee Structural Lexicon Task Force

Defining brain structures Translate among terminologies

Neuronal Registry Task Force Consistent naming scheme for neurons Knowledge base of neuron properties

Representation and Deployment Task Force Formal representation

Also interacts with Digital Atlasing Task Force

http://incf.org

Page 21: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

NeuroLex Wiki

http://neurolex.org Stephen Larson

•Provide a simple framework for defining the concepts required

• Light weight semantics

• Good teaching tool for learning about semantic integration and the benefits of a consistent semantic framework

•Community based:• Anyone can

contribute their terms, concepts, things

• Anyone can edit• Anyone can link

•Accessible: searched by Google

•Building an extensive cross-disciplinary knowledge base for neuroscience

Demo D03

Page 22: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Defining nervous system structures

Parcellation scheme: Set of parcels occupying part or all of an anatomical entity that has been delineated using a common approach or set of criteria, often in a single study. A parcellation scheme for any given individual entity may include gaps, transitional zones, or regions of uncertainty. A parcellation scheme derived from a set of individuals registered to a common target (atlas) may be probabilistic and include overlap of parcels in regions that reflect individual variability or imperfections in alignment. 14 parcellation schemes currently represented in Neurolex

Documentation available INCF task force on

ontologies

Page 23: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Basic model: do not conflate conceptual structures with parcels

Regional part of nervous system

Functional part of nervous system

Parceloverlaps

overlaps overlaps

Parcel Parcel

Neuroscientists have a lot of different parcellation schemes because they have a lot of different ways of classifying brain structures and techniques to match them are imperfect

Page 24: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Linking semantics to space: INCF Atlasing

www.neurolex.org

Link to spatial representation

in scalable brain atlas

Waxholm space

Seth Ruffins, Alan Ruttenberg, Rembrandt Bakker

Page 25: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Neurons in Neurolex International

Neuroinformatics Coordinating Facility (INCF) building a knowledge base of neurons and their properties via the Neurolex Wiki

Led by Dr. Gordon Shepherd

Consistent and parseable naming scheme

Knowledge is readily accessible, editable and computable

While structure is imposed, don’t worry too much about the upper level classes of the ontology

Stephen Larson

Page 26: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

26

A KNOWLEDGE BASE OF NEURONAL PROPERTIES

Additional semantics added in NIFSTD by ontology engineer

Page 27: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Concept-based search: search by meaning

Search Google: GABAergic neuron Search NIF: GABAergic neuron

NIF automatically searches for types of GABAergic neurons

Types of GABAergic neurons

Page 28: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Challenges of multiscale neurodegenerative disease

phenotypes

•Neurodegenerative diseases target very specific cell populations•Model systems only replicate a subset of features of the disease•Related phenotypes occur across anatomical scales•Different vocabularies are used by different communities

not

not

Midbrain degenerated

Substantia nigra decreased in volume

Substantia nigra pars compacta atrophied

Loss of Snpc dopaminergic neurons

Degeneration of nigrostriatal terminals

Tyrosine-hydroxylase containing neurons

degenerate

Page 29: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Approach: Use ontologies to provide necessary knowledge for matching

related phenotypes

Sarah Maynard, Chris Mungall, Suzie Lewis, Fahim Imam

Midbrain

Substantia nigra

Substantia nigra pars compacta

Substantia nigra pars compacta dopamine cell

Dopamine

Neuron cell soma

Neuron (CL)

Part of neuron (GO)

Small molecule (Chebi)

Atrophied

Decreased volume

Fewer in number

Degenerate

Decreased in magnitude relative to

some normal

Has part

Has part

Is part of

Has part

Has part

Is a

Is a Is a

Is a

Entities

Qualities

NIFSTD/PKBOBO ontology

Page 30: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Alzheimer’s disease

Human(birnlex_516)

Neocortex pyramidal

neuron

Increased number of

Lipofuscin

has part

inheres in

inheres in

towards

EQ Representation of Phenotypes in Neurodegenerative Disease: PATO and

NIFSTD

Instance: Human with Alzheimer’s disease 050

Phenotype birnlex_2087_56

inheres in

about

Chris Mungall, Suzanna Lewis

Structured annotation model implemented in

WIB

Page 31: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

OBD: Ontology based database

Provides a user interface for matching organisms based on similarity of phenotypes Based on EQ

model

Uses knowledge in the ontology to compute similarity scores and other statistical measures like information content

http://www.berkeleybop.org/pkb/Chris Mungall, Suzanna Lewis, Lawrence

Berkeley Labs

Page 32: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Thalamus

Cellular inclusion

Midline nuclear group

Lewy Body

Paracentral nucleus

Cellular inclusion

Computes common subsumers and information content among

phenotypes

Page 33: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

*B6CBA-TgN (HDexon1)62) that express exon1 of the human mutant HD gene- Li et al., J Neurosci, 21(21):8473-8481

PhenoSim: What organism is most similar to a human with Huntington’s

disease?

Putamen atrophiedGlobus pallidus

neuropil degenerate

Part of basal ganglia decreased

in magnitude

Fewer neostriatum medium spiny

neurons in putamenNeurons in striatum

degenerate

Neuron in striatum decreased in magnitude

Increased number of astrocytes in

caudate nucleusNeurons in striatum

degenerate

Nervous system cell change in

number in striatum

Page 34: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Progressive enrichment

Understanding and comparing phenotypes will be enriched through community knowledge bases like Neurolex

Looking forward to continuing this as part of the Monarch project with Melissa Haendel, Chris Mungall and Suzie Lewis

Page 35: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Top Down vs Bottom upTop-down ontology construction

• A select few authors have write privileges• Maximizes consistency of terms with each other (automated consistency checking)• Making changes requires approval and re-publishing• Works best when domain to be organized has: small corpus, formal categories, stable entities, restricted entities, clear edges.• Works best with participants who are: expert catalogers, coordinated users, expert users, people with authoritative source of judgment

Bottom-up ontology construction• Multiple participants can edit the ontology instantly (many eyes to correct errors)• Semantics are limited to what is convenient for the domain• Not a replacement for top-down construction; sometimes necessary to increase flexibility• Necessary when domain has: large corpus, no formal categories, no clear edges• Necessary when participants are: uncoordinated users, amateur users, naïve catalogers• Neuroscience is a domain that is less formal and neuroscientists are more uncoordinated

NIFSTD

NEUROLEX

Important for Ontologists to define community contribution model

Page 36: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

It’s a messy ecosystem (and that’s OK)

NIF favors a hybrid, tiered, federated system

Domain knowledge Ontologies

Claims about results Virtuoso RDF

triples

Data Data federation Workflows

Narrative Full text access

NeuronBrain part

Disease

Organism

Gene

Caudate projects to Snpc Grm1 is

upregulated in chronic cocaineBetz cells

degenerate in ALS

Page 37: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Musings from the NIF No one can be stopped from doing what they

need to do Every resource is resource limited: few have

enough time, money, staff or expertise required to do everything they would like If the market can support 11 MRI databases, fine Some consolidation, coordination is warranted though

Big, broad and messy beats small, narrow and neat Without trying to integrate a lot of data, we will not know

what needs to be done A lot can be done with messy data; neatness helps though Progressive refinement; addition of complexity through

layers

Be flexible and opportunistic A single optimal technology/container for all types of

scientific data and information does not exist; technology is changing

Think globally; act locally: No source, not even NIF, is THE source; we are all a source

Page 38: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Grabbing the long tail of small data

Analysis of NIF shows multiple databases with similar scope and content

Many contain partially overlapping data

Data “flows” from one resource to the next Data is

reinterpreted, reanalyzed or added to

Is duplication good or bad?

Page 39: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Same data: different analysis

Chronic vs acute morphine in

striatum

Drug Related Gene database: extracted statements from figures, tables and supplementary data from published article

Gemma: Reanalyzed microarray results from GEO using different algorithms

Both provide results of increased or decreased expression as a function of experimental paradigm 4 strains of mice 3 conditions: chronic

morphine, acute morphine, saline

Mined NIF for all references to GEO ID’s: found small number where the same dataset was represented in two or more databaseshttp://www.chibi.ubc.ca/Gemma/

home.html

Page 40: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

How easy was it to compare?

Gemma: Gene ID + Gene SymbolDRG: Gene name + Probe ID

Gemma: Increased expression/decreased expressionDRG: Increased expression/decreased expression

But...Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases

Analysis: 1370 statements from Gemma regarding gene expression as a

function of chronic morphine 617 were consistent with DRG; over half of the claims of the

paper were not confirmed in this analysis Results for 1 gene were opposite in DRG and Gemma 45 did not have enough information provided in the paper to make

a judgment

NIF annotation standard

Page 41: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

Beware of False Dichotomies

Top-down vs bottom up

Light weight vs heavy weight

“Chaotic Nihilists and Semantic Idealists” Text mining vs annotation

Curators vs scientists

Human vs machine

DOI’s vs URI’s

http://www.datanami.com/datanami/2013-02-05/chaotic_nihilists_and_semantic_idealists.html

Page 42: The real world of ontologies and phenotype representation:  perspectives from the Neuroscience Information Framework

NIF team (past and present)

Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum

Fahim Imam, NIF Ontology EngineerLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceLee HornbrookBinh NgoVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer