knowledge-based integration of neuroscience data sources amarnath gupta bertram ludäscher maryann...

17
Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Upload: wesley-clark

Post on 05-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Knowledge-Based Integration of Neuroscience Data Sources

Amarnath Gupta

Bertram Ludäscher

Maryann Martone

University of California San Diego

Page 2: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

A Standard Information Mediation Framework

Client Query

Integrated XML View

DataSource

XML DataSource

DataSource

XMLView

Wrapper Wrapper XMLView

XMLView

MediatorMediatorView Definition

Page 3: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

A Neuroscience Question

protein localization

Cerebellar distribution of rat proteins with more than 70%homology with human NCS-1? Any structure specificity?

How about other rodents?

Integrated View

MediatorMediatorView Definition

morphometry neurotransmission

WWW

CaBP, Expasy

Wrapper WrapperWrapper Wrapper

Page 4: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Integration Issues

• Structural Heterogeneity– Resolved by converting to common semistructured data

model

• Heterogeneity in Query Capabilities– Resolved by writing wrappers with binding patterns

and other capability-definition languages

• Semantic Heterogeneity– Schema conflicts

• Partially resolved by mapping rules in the mediator

– Hidden Semantics?

Page 5: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Hidden Semantics:Protein Localization

<protein_localization><neuron type=“purkinje cell” /><protein channel=“red”>

<name>RyR</>….</protein><region h_grid_pos=“1” v_grid_pos=“A”>

<density> <structure fraction=“0.8”>

<name>spine</><amount name=“RyR”>0</>

</> <structure fraction=“0.2”>

<name>branchlet</><amount name=“RyR”>30</>

</>

Molecular layer ofCerebellar Cortex

Purkinje Cell layer ofCerebellar Cortex

Fragment of dendrite

Page 6: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Hidden Semantics: Morphometry<neuron name=“purkinje cell”>

<branch level=“10”> <shaft>

… </shaft>

<spine number=“1”><attachment x=“5.3” y=“-3.2”

z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</>

<length>1.79</> </head>

</spine> …

Branch level beyond 4 is a branchlet

Must be dendritic because Purkinje cells

don’t have somatic spines

Page 7: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

The Problem

• Multiple Worlds Integration– compatible terms not directly joinable– complex, indirect associations among schema elements– unstated integrity constraints

• Why not use ontologies?– typical ontologies associate terms along limited number

of dimensions

• What’s needed– a “theory” under which non-identical terms can be

“semantically” joined

Page 8: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Our Approach• Modify the standard Mediation Architecture

– Wrapper • Extend to encode an object-version of the structure schema

– Mediator• Redesign to incorporate auxiliary knowledge sources to

– Correlate object schema of sources– Define additional objects not specified but derivable from sources

• At the Mediator– Use a logic engine to

• Encode the mapping rules between sources• Define integrated views using a combination of exported objects

from source and the auxiliary knowledge sources• Perform query decomposition

• We still use Global-as-View form of mediation

Page 9: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

The KIND Architecture

View Definition Rules

Logic Engine Integration Logic

Schema of Registered Sources

Integrated User ViewAuxiliary

KnowledgeSource 1

AuxiliaryKnowledge

Source 2

Object Wrapper

Structure Wrapper

Object Wrapper

Structure Wrapper

Src 1 Src 2

MaterializedViews

Page 10: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

The Knowledge-Base• Situate every data object in its anatomical context

– An illustration

– New data is registered with the knowledge-base

– Insertion of new data reconciles the current knowledge-base with the new information by:

• Indexing the data with the source as part of registration

• Extending the knowledge-base

• Creating new views with complex rules to encode additional domain knowledge

Page 11: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

F-Logic for the Mediation Engine

• Why F-Logic?– Provides the power of Datalog (with negation) and

object creation through Skolem IDs – Correct amount of “notational sugar” and rules to

provide object-oriented abstraction– Schema-level reasoning– Expressing variable arity

• F-Logic in KIND– Source schema wrapped into F-Logic schema– Knowledge-sources programmed in F-Logic– Definition of Integrated Views

Page 12: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Wrapping into Logic Objects

• Automated Part<!ELEMENT Studies (Study)*><!ELEMENT Study (study_id, … animal, experiments, experimenters><!ELEMENT experiments (experiment)*><!ELEMENT experiment (description, instrument, parameters)>

studyDB[studies study].study[study_id string; … animal animal; experiments experiment; experimenters string].…

• Non-automated Part• Subclasses

• Rules

• Integrity Constraints

mushroom_spine::spine

S:mushroom_spine IF S:spine[head_;neck _].

ic1(S):alert[type “invalid spine”; object S] IF S:spine[undef {head, neck}].

Page 13: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Computing with Auxiliary Sources

• Creating Mediated Classes

• Reasoning with Schema

animal[MR] IF S:source, S.animal [MR] .animal[taxon ‘TAXON’.taxon].X[taxonT] IF X: ‘PROLAB’.animal[name N],

words(N,[W1,W2|_]), T: ‘TAXON’.taxon[genus W1;species W2].

union view

association rule

taxon[subspecies string; species string; genus string; … phylum string; kingdom string; superkingdom string].Schema

subspecies::species::genus:: … kingdom::superkingdomAt Mediator

T:TR, TR::TR1 IFT: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1],Taxon_Rank::Taxon_Rank1.

Class creation byschema reasoning

Page 14: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Integrated View Definition

• Views are defined between sources and knowledge base• Example: protein_distribution

– given: organism, protein, brain_region– KB Anatom:

• recursively traverse the has_a paths under brain_region collect all anatomical_entities

– Source PROLAB:• join with anatomical structures and collect the value of attribute

“image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism

– Mediator:• aggregate over all parents up to brain_region• report distribution

Page 15: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

a secondintegrated view

Query Evaluation Example

• protein distribution of Human NCS-1 homologue– from wrapped CaBP website:

• get the amino acid sequence for human NCS-1

– from wrapped Expasy website:• submit amino acid sequence, get ranked homologues

– at Mediator:• select homologues H found in rat, and homology > 0.70

– at Mediator:• for each h in H

– from previous view:» protein_distribution(rat, h, cerebellum, distribution)

• Construct result

Page 16: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Implementation

• System– Flora as F-Logic Engine

– Communicate with ODBC databases through underlying XSB Prolog

– XML wrapping and Web querying through XMAS, our XML query language and custom-built wrappers

• Data– Human Brain Project sites

– NPACI Neuroscience Thrust sites

Page 17: Knowledge-Based Integration of Neuroscience Data Sources Amarnath Gupta Bertram Ludäscher Maryann Martone University of California San Diego

Work in Progress

• Architecture– plug-in architecture for

• domain knowledge sources• conceptual models from data sources

• Functionality– better handling of large data– operations

• expressive query language• operators for domain knowledge manipulation

– query evaluation• query optimization using domain knowledge

• Demonstration– at VLDB 2000