1 tutorial #5: scientific data integration and mediation san diego supercomputer center u.c. san...

95
1 Tutorial #5: Scientific Data Integration and Mediation San Diego Supercomputer Center San Diego Supercomputer Center U.C. San Diego U.C. San Diego Bertram Lud Bertram Lud ä ä scher scher Ilkay Altintas Ilkay Altintas Amarnath Gupta Amarnath Gupta Kai Lin Kai Lin

Upload: nathaniel-sims

Post on 28-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

1

Tutorial #5:Scientific Data Integration and Mediation

San Diego Supercomputer CenterSan Diego Supercomputer Center

U.C. San DiegoU.C. San Diego

Bertram LudBertram Ludääscherscher

Ilkay AltintasIlkay Altintas

Amarnath GuptaAmarnath Gupta

Kai LinKai Lin

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 2

Acknowledgements• National Science Foundation (NSF)– www.nsf.gov

• GEOsciences Network (NSF) – www.geongrid.org

• Biomedical Informatics Research Network (NIH)– www.nbirn.net

• Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org

• Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 3

Outline• 8:30 – 10:30am: Tutorial: Data Integration & Mediation

– Introduction to database mediation: • motivation and architecture• XML-based data integration

– Database mediation theory primer: • logic view definitions, view unfolding, computing feasible plans

– From XML-based to Knowledge-based mediation:• use of ontologies in data integration, ...

• 10:30 – 10:45am: BREAK

• 10:45 – 12:00: Applications and Demos– 10:45 – 11:05 Mediator Demo – 11:05 – 11:20 Queries w/ Ontology Support – 11:20 – 11:40 Scientific Workflows – 11:40 – 12:00 KNOW-ME Ontology Tool

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 4

Information Integration Challenges • System aspects: “Grid” Middleware

– distributed data & computing– Web Services, WSDL/SOAP, …– sources = functions, files, databases, …

• Syntax & Structure: XML-Based Mediators

– wrapping, restructuring – XML queries and views– sources = XML databases

• Semantics: Model-Based/Semantic Mediators

– conceptual models and declarative views – SemanticWeb/KnowledgeGrid stuff:

ontologies, description logics (RDF(S), DAML+OIL, OWL ...)

– sources = knowledge bases (DB+CMs+ICs)

SyntaxSyntax

StructureStructure

SemanticsSemantics

System aspectsSystem aspects

reconciling reconciling SS44 heterogeneitiesheterogeneities

““gluing” together gluing” together multiple data sources multiple data sources

bridging information bridging information and knowledge gaps and knowledge gaps computationallycomputationally

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 5

Information Integration from a DB Perspective

• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user

questions Q1,..., Qn that can be answered using the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...)

Si can be queried

define virtual (or materialized) integrated views V over S1 ,..., Sk using database query languages (SQL, XQuery,...)

questions become queries Qi against V(S1,..., Sk)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 6

Standard (XML-Based) Mediator Architecture

MEDIATORMEDIATOR

XML Queries & Results

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) View

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )

wrappers implementedas web services

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 7

• Data Integration Approaches:– Let’s just share data, e.g., link everything from a web page!– ... or better put everything into an relational or XML database– ... and do remote access using the Grid– ... or just use Web services!

• Nice try. But: – “Find the files where the amygdala was segmented.”– “Which other structures were segmented in the same files?”– “Did the volume of any of those structures differ much from

normal?”– “What is the cerebellar distribution of rat proteins with more

than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”

Some BIRNing Data Integration Questions

Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net

An Online Shopper’s Information Integration Problem

El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”

?Information Integration

?Information Integration

addall.comaddall.com

“One-World”Mediation

“One-World”Mediation

amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com

WWWpublic library

A Home Buyer’s Information Integration Problem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population?

?Information Integration

?Information Integration

RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats

“Multiple-Worlds”Mediation

“Multiple-Worlds”Mediation

A Geoscientist’s Information Integration ProblemWhat is the distribution and U/ Pb zircon ages of A-type plutons in VA?

How about their 3-D geometry ? How does it relate to host rock structures?

?Information Integration

?Information Integration

Geologic Map(Virginia)

Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical

(gravity contours)

GeoPhysical(gravity contours)

GeoChronologic(Concordia)

GeoChronologic(Concordia)

Foliation Map(structure DB)

Foliation Map(structure DB)

“Complex Multiple-Worlds”

Mediation

“Complex Multiple-Worlds”

Mediation

A Neuroscientist’s Information Integration ProblemWhat is the cerebellar distribution of rat proteins with more than 70%

homology with human NCS-1? Any structure specificity?How about other rodents?

?Information Integration

?Information Integration

protein localization(NCMIR)

protein localization(NCMIR)

neurotransmission(SENSELAB)

neurotransmission(SENSELAB)

sequence info(CaPROT)

sequence info(CaPROT) morphometry

(SYNAPSE)

morphometry(SYNAPSE)

“Complex Multiple-Worlds”

Mediation

“Complex Multiple-Worlds”

Mediation

Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 12

Structural / XML-Based Mediation

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 13

Abstract XML-Based Mediator Architecture

S_1

MEDIATORMEDIATOR

XML Queries & Results

USER/ClientUSER/Client

Wrapper

XML View

S_2

Wrapper

XML View

S_k

Wrapper

XML View

IntegratedXML View V

Integrated ViewDefinition

IVD(S1,...,Sn)

Query Q o V (S_1,...,S_k)Query Q o V (S_1,...,S_k)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 14

Extensible Markup Language (XML)

• (meta)language for marking up text & data with user-definable tags– (X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, ...

• semistructured tree data model– flexible: marked-up text, web-pages,

databases, ...

• container model: – “boxes within boxes”

• (meta)language for marking up text & data with user-definable tags– (X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, ...

• semistructured tree data model– flexible: marked-up text, web-pages,

databases, ...

• container model: – “boxes within boxes”

... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how ...

author:“B. Schatz”

book:

title:“SemWeb Tractat”

author:“T.B. Lee”

book

title author

“SemWeb Tractat”

author

“B. Schatz”“T.B. Lee”

<book> <title>SemWeb Tractat</title> <author>B. Schatz</author> <author>T.B. Lee</author></book>

... in their wonderful book called <title>SemWeb Tractat </title> by B. Schatz and T.B. Lee, the authors show how ...

... in their wonderful book called <title>SemWeb Tractat</title> by <author>B. Schatz</author> and <author> T.B. Lee</author>, the authors show how ...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 15

Example: Relational Data => XML

c2b2a2

c3b3a3

c1b1a1

CBA

R Rtuple

A a1 /AB b1 /BC c1 /C

/tupletuple

A a2 /AB b2 /BC c2 /C

/tuple …

/R

R

tuple

A B Ca1 b1 c1

tuple

A B Ca2 b2 c2

tuple

A B Ca3 b3 c3

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 16

Tag Names & Nesting => XML DTDs (Grammars)

<!ELEMENT bibliography paper*><!ELEMENT paper (authors,fullPaper?,title,booktitle)><!ELEMENT authors author+>

XML DTD

bibliography paper* paper authors fullPaper? title booktitle authors author+

Grammar Rules

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 17

XML DTDs vs. XML Schema

• XML DTDs– set of allowed tag names

– their nesting structure (via grammar rules)

• XML Schema– tag names and nesting structure

– user-defined complex data types

– subtyping (no multiple inheritance): RESTRICT and EXTEND

– separate “namespace” for type names and tag (=element) names

– ...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 18

XML Schema: User-Defined Type/Class Hierarchy

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 19

XML Schema Declarations (“home-style” syntax)

Complex Type Declarations

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 20

XML Schema (“home-style”)

Complex Types

Simple Type Declarations

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 21

XML Schema: Substitution Groups

Elements of a substitution group (hexagons) and associated complex types (boxes)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 22

XML Schema Declarations (W3C syntax)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 23

XML Query Languages

• XPath:

– root//books/book[cover_style=“paperback”][price<80]

• XQuery– the W3C XML query language

• XSLT – XML transformations (XML=>HTML, XML=>XML)

• ...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 24

Transforming and Rendering XML: XSLT

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 25

XMAS: XML Matching And Structuring language

Integrated View Definition:“Find books from amazon.com

and DBLP, join on author,group by authors and title”

CONSTRUCT <books> <book>

$a1$t<pubs>

$p { $p } </pubs>

</book> { $a1, $t } </books>WHERE <books.book>

$a1 : <author />$t : <title />

</> IN "amazon.com" AND <authors.author>

$a2 : <author /><pubs> $p : <pub/> </>

</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )

CONSTRUCT <books> <book>

$a1$t<pubs>

$p { $p } </pubs>

</book> { $a1, $t } </books>WHERE <books.book>

$a1 : <author />$t : <title />

</> IN "amazon.com" AND <authors.author>

$a2 : <author /><pubs> $p : <pub/> </>

</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )

XMAS

XMAS Algebra

26

Database Mediation Theory Primer

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 27

Mediator Query Processing

TranslatorTranslator

Rewriter/OptimizerRewriter/Optimizer

composed plan

optimized plan

Query Q

Composition (Q o V)Composition (Q o V)

Integrated ViewDefinition V

parsed plan

Plan Execution Plan Execution

Compile-timeCompile-time

Run-timeRun-time

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 28

Logic View Definitions (Global-as-View) or

Querying and Reasoning with the Family ...

• Warm up: Who says this?– “Your are my son, but I’m not your father!”

• The mother!

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 29

Logic View Definitions (Global-as-View)• Globals-as-View (GAV)

– Integrated view V is defined in terms of the sources Src_1, ... , Src_k

• Given the following source databases:– Src_1 schema = { father(Father,Child), mother(Mother,Child) }

– Src_2 schema = { spouse(Spouse, Spouse) }

– Src_3 schema = { male(Person), female(Person) }

• Can you define integrated views V for ... ?– parent(Parent,Child)

• short: parent/2, i.e., table/relation name is ‘parent’, arity (#columns) is 2

– son/2, daughter/2

– brother/2, sister/2

– brother_in_law/2, sister_in_law/2

– aunt/2, uncle/2

– married/2, bachelor/2

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 30

Logic View Definitions (Global-as-View)Source relations: father/2, mother/2, spouse/2, male/1, female/1

= “,” = conjunction (and) = “ ; ” = disjunction (or)

= “not” = negation • parent(C,P) father(C,P) ; mother(C,P) . • son(P,S) parent(S,P) , male(S) .• brother(X,B) parent(X,P), son(P,B), X B .• brother_in_law(X,B) sister(X, Z), spouse(Z, B) ; spouse(X, Z), brother(Z, B) .

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 31

Logic View Definitions (Global-as-View)Source relations: father/2, mother/2, spouse/2, male/1, female/1

= “,” = conjunction (and) = “ ; ” = disjunction (or)

= “not” = negation • uncle(X, U) parent(X, Z), brother(Z, U) ; parent(X, Z), brother_in_law(Z, U) .• aunt(X, A) parent(X, Z), sister(Z, A) ; parent(X, Z), sister_in_law(Z, A) .• married(X) spouse(X, _) .• bachelor(X) [person(X)] , not married(X) .

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 32

Query Rewriting and Query Evaluation

• Query Rewriting:- Given a user query Q in terms of virtual views V...

- Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k

• Query Evaluation:- Given a query Q’, evaluate Q’ over the source databases

D := Src_1 ... Src_k

• Examples:– Q_uncle/2 = { (X,Y) | uncle(X,Y) holds in D }

– Q_tom’s_uncle/1 = { X | uncle(tom, X) holds in D }

– Q_whose_uncle_is_tom/1 = { X | uncle(X, tom) holds in D }

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 33

Query Rewriting (for GAV)

• Query rewriting:- Given a user query Q in terms of virtual views V...

- Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k

• Query Q, views V, source schemas S

• View unfolding:– starting with Q, repeatedly replace view predicates by the definition

• Creating a feasible plan:– here: compute disjunctive normal form (DNF)

– DNF = disjunction of conjunctions (= “union of joins”)

– order goals within each conjunction according to sources’ query capabilities

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 34

Example• ?- plan(brother(X0,X1)) .

brother(X0, X1)

== LQP ==>

(father(X0, X2) v mother(X0, X2))

& (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)

brother(X0, X1)

==NNF LQP==>

(father(X0, X2) v mother(X0, X2))

& (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 35

Example (Cont’d)• ?- plan(brother(X0,X1)) .

brother(X0, X1)

==DNF LQP==>

father(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)

v mother(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)

v father(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)

v mother(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 36

Example (Cont’d)• ?- plan(brother(X0,X1)) .

brother(X0, X1)

==Bp ordered LQP==>

parentDb(father(X1, X2) & father(X0, X2))

& genderDb(male(X1)) & mediator(neq(X0, X1))

v parentDb(father(X1, X2) & mother(X0, X2))

& genderDb(male(X1)) & mediator(neq(X0, X1))

v parentDb(mother(X1, X2)&father(X0,X2))

& genderDb(male(X1)) & z_mediator(neq(X0, X1))

v parentDb(mother(X1, X2)&mother(X0, X2))

& genderDb(male(X1))&z_mediator(neq(X0, X1))

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 37

Computing Feasible Plans (Goal Ordering)• A conjunctive query Q is an expression of the

form– q( X ) p1( X1 ) , ..., pn( Xn )– order of subgoals p_i is irrelevant

• An ordered plan P is an expression of the form– q( X ) [p1( X1 ) , ..., pn( Xn )]– order of subgoals p_i is important

• Problem:– given Q, compute P which is feasible, i.e., observes the limited

query capabilities of sources– Here: binding patterns, i.e., predicates’ arguments can be

• “b” – bound • “f” – free • “_” – bound or free

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 38

A Simple Algorithm for Ordering Goals

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 39

Query Containment• A query Q1 is contained in Q2, denoted Q1 Q2

– if for all possible database instances, the set of answers to Q1 is contained in the set of answers to Q2.

• Q1 and Q2 are called equivalent– if Q1 Q2 and Q2 Q1.

• Query containment is undecidable for many languages, e.g., for the relational calculus (SQL).

• For conjunctive queries, the problem is NP-complete (and thus decidable) – Since query sizes tend to be “small” (in particular, when

compared to database sizes), query containment is still of use in practice (indeed, it is one of the most fundamental tools for logic-based query optimization).

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 40

Query Containment• Q1(Xs,Ys) is contained in Q2(Xs,Zs) iff

ALL Xs: (EXISTS Ys: Q1(Xs,Ys)) (EXISTS Zs: Q2(Xs,Zs))

• iff we can refute its negation• iff

NOT ALL Xs: (EXISTS Ys: Q1(Xs,Ys)) (EXISTS Zs: Q2(Xs,Zs)) |= []

• iffEXISTS Xs: (EXISTS Ys: Q1(Xs,Ys)) AND NOT (EXISTS Zs: Q2(Xs,Zs)) |= []

• iff– canonical_db(Q1) AND Q2(Xs,Zs) |= []

• create database from Q1, then run Q2 as a query...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 41

Query Containment Algorithm (in Prolog)

• Applications: – query minimization (conjunctive query is minimal if not

conjunct can be dropped)– semantic query optimization

• Q denial • here: denial is an integrity constraint and states what must not hold• example: denial = false mother(X,M), father(Y,M)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 42

Example

• 50% of the clauses of the executable plan are irrelevant ...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 43

Mediator Demo

• Computer Science Challenges:– Given a query Q over virtual integrated database V, how to come up with Q’

over the source schemas? (cf. Garlic, DiscoveryLink, ...)• query rewriting of Q(V) into Q’(SRCs) using unfolding and normalization• computation of feasible orders (NP-complete!?) while minimizing number of

“chunks” sent to sources• semantic query optimization (reasoning over plans!); e.g. conjunctive query

containment is NP-complete [Chandra-Merlin-77]

• A Quick Demo of the current prototype: – Find 3D reconstructions of cells found in ‘cerebellar cortex’:

• ?- ccdbData('cerebellar cortex').• Join everything reachable along ‘cerebellar-cortex’.(has-a)* in UMLS • ....with concept markup in CCDB• ... retrieve (links to) results• ... also show on SmartAtlas tool

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 44

Mediator Demo

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 45

From XML-Based to Logic and Model-Based (“Semantic”) Mediation

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 46

What’s the Problem with XML & Complex Multiple-Worlds?

• XML is Syntax– DTDs talk about element nesting– XML Schema schemas give you data types – need anything else? => write comments!

• Domain Semantics is complex:– implicit assumptions, hidden semantics sources seem unrelated to the non-expert

• Need Structure and Semantics beyond XML trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 47

From XML-Based to Model-Based Mediation• Data and Knowledge Sharing Potential:

Database Mediation + Knowledge Representation

________________________

= Model-Based Mediation

• Basic Ideas:– turn primary data sources into knowledge sources– employ secondary glue knowledge sources

• generic: UMLS, ... • specific: community/laboratory ontologies

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 48

DB mediation techniques

OntologiesKR formalisms

Model-Based Mediation

Information Integration Landscape

conceptual distanceone-world multiple-worlds

conceptual complexity/depth

low

high

addallbook-buyer

BLAST

EcoCyc

Cyc

WordNet

GO

home-buyer24x7 consumer

UMLS

MIA Entrez

RiboWeb

Tambis

BioinformaticsGeoinformatics

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 49

Knowledge Representation:Relating Theory to the World via Formal Models

John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations

“All models are wrong, but some are useful!”

XML-Based vs. Model-Based Mediation

Raw DataRaw DataRaw Data

IF THEN IF THEN IF THEN

LogicalDomainConstraints

Integrated-CM :=

CM-QL(Src1-CM,...)

Integrated-CM :=

CM-QL(Src1-CM,...)

. . ....

....

........ (XML)Objects

Conceptual Models

XMLElements

XML Models

C2 C3

C1

R

Classes,Relations,is-a, has-a, ...

Glue Maps

DMs, PMs

Glue Maps

DMs, PMs

Integrated-DTD :=

XML-QL(Src1-DTD,...)

Integrated-DTD :=

XML-QL(Src1-DTD,...)

No DomainConstraints

A = (B*|C),DB = ...

Structural Constraints (DTDs),Parent, Child, Sibling, ...

CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 51

What’s the Glue? What’s in a Link?

• Syntactic Joins (X,Y) := X.SSN = Y.SSN equality (X,Y) := X.UMLS-ID = Y.UID

• “Speciality” Joins (X,Y,Score) := BLAST(X,Y,Score) similarity

• Semantic/Rule-Based Joins (X,Y,C) :=

X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub

(X,Y,[produces,B,increased_in]) :=

X produces B, B increased_in Y. rule-based

e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease

• Challenge: – compile semantic joins into efficient syntactic ones

XY

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 52

Model-Based Mediation Methodology ...

• Lift Sources to export CMs:

CM(S) = OM(S) + KB(S) + CON(S)

• Object Model OM(S):– complex objects (frames), class hierarchy, OO constraints

• Knowledge Base KB(S):– explicit representation of (“hidden”) source semantics

– logic rules over OM(S)

• Contextualization CON(S):– situate OM(S) data using “glue maps” (GMs): domain maps DMs (ontology)

= terminological knowledge: concepts + roles process maps PMs

= “procedural knowledge”: states + transitions

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 53

... Model-Based Mediation Methodology

• Integrated View Definition (IVD)– declarative (logic) rules with object-oriented features

– defined over CM(S), domain maps, process maps

– needs “mediation engineers” = domain + KRDB experts

• Knowledge-Based Querying and Browsing (runtime):– mediator composes the user query Q with the IVD

... rewrites (Q o IVD), sends subqueries to sources

... post-processes returned results (e.g., situate in context)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 54

S1 S2

S3

(XML-Wrapper) (XML-Wrapper) (XML-Wrapper)

CM-Wrapper CM-Wrapper CM-Wrapper

USER/ClientUSER/Client

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

CM(S) =OM(S)+KB(S)+CON(S)

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM Queries & Results (exchanged in XML)

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Domain MapsDMs

Process MapsPMs

“Glue” MapsGMs

semanticcontextCON(S)

Integrated View Definition IVD

Model-Based Mediator Architecture

First results & Demos:KIND prototype, formal

DM semantics, PMs[SSDBM00] [VLDB00][ICDE01] [NIH-HB01]

(w/ Gupta, Martone)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 55

Domain Maps (Ontologies) as Glue Knowledge Sources

• Domain Map = Ontology– representation of terminological knowledge

• Use in Model-Based Mediation– (derived) concepts as “drop points”, “anchor points”, “context”

for source classes

– compile-time use: view definition, subsumption, classification,...

– runtime use: querying/deduction, path queries, ....

• Formalisms:– Semantic nets, Thesauri, Frame-logic, Description logics, ...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 56

Ontologies• So what is an Ontology?

– definition of things that are relevant to your application– representation of terminological knowledge (“TBox”)– explicit specification of a conceptualization– concept hierarchy (“is-a”)– further semantic relationships between concepts– abstractions of relational schemas, (E)ER, UML classes, XML

Schemas

• Examples:– NCMIR ANATOM– GO (Gene Ontology)– UMLS (Unified Medical Language System– CYC

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 57

Formalism for Ontologies: Description Logic

• DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK)

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 58

Description Logic Statements as Rules

• In first-order logic (rule form):happyFather(X)

man(X), child(X,C1), child(X,C2), blue(C1), green(C2),

not ( child(X,C3), poorunhappyChild(C3) ).

poorunhappyChild(C)

not rich(C), not happy(C).

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 59

Description Logics

• Terminological Knowledge (TBox)– Concept Definition (naming of concepts):

– Axiom (constraining of concepts):

=> a mediators “glue knowledge source”

• Assertional Knowledge (ABox)– the marked neuron in image 27

=> the concrete instances/individuals of the concepts/classes that your sources export

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 60

Querying vs. Reasoning

• Querying: – given a DB instance I (= logic interpretation), evaluate a query

expression (e.g. SQL, FO formula, Prolog program, ...)– boolean query: check if I |= (i.e., if I is a model of ) – (ternary) query: { (X, Y, Z) | I |= (X,Y,Z) } => check happyFathers in a given database

• Reasoning:– check if I |= implies I |= for all databases I, – i.e., if => – undecidable for FO, F-logic, etc.– Descriptions Logics are decidable fragments concept subsumption, concept hierarchy, classification semantic tableaux, resolution, specialized algorithms

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 61

What’s in an Answer?(What’s in a Link? revisited)

• Semantic/Rule-Based Joins

(X,Y,[produces,B,increased_in]) :=

X produces B, B increased_in Y. rule-based

e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease

• What is the Erdoes number of person P? – 3

• Really? Why?– authority based: <VIP> said so– faith based: don’t know but firmly believe– query statement Q = ... derived it from DB I– query Q = ... derived it from DB I and KB T using derivation D=> logic-based systems often “come with explanations”

(“computations as proofs”)

XY

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 62

Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR

Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)

Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)

Domain Map (DM)

Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).

Domain Expert Knowledge

DM in Description Logic

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 63

Source Contextualization & DM Refinement

In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...

sources can register new concepts at the mediator ...

Example:ANATOM Domain Map

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 65

Browsing Registered Data with Domain Maps

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 66

Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue

• nodes ~ states• edges ~ processes, transitions• blue/red edges:

• processes in Src1/Src2• general form of edges:

related formalisms

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 67

Summary: Mediation Scenarios & TechniquesFederated Databases XML-Based Mediation Model-Based Mediation

One-World One-/Multiple-Worlds Complex Multiple-Worlds

Common Schema Mediated Schema Common Glue Maps

SQL, rules XML query languages DOOD query languages

Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings

Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps

DB expert DB expert KRDB + domain expert

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 68

Semantic (Community) Webs “Within the next decade, computing

technology will transform the Internet into the Interspace, an information

infrastructure that supports semantics indexing and concept navigation across

widely distributed community repositories.”

Bruce Schatz, IEEE Computer, Jan. 2002

"The Semantic Web is an extension of the current web in which information is given

well-defined meaning, better enabling computers and people to work in

cooperation."Tim Berners-Lee et al., 2001

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 69

Combine Everything:Die eierlegende Wollmilchsau:

• Database Federation/Mediation– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing

• Semantic Mediation– semantic integrity constraints, reasoning w/ plans, automated

deduction– deductive database/logic programming technology, AI “stuff”...– Semantic Web technology

• Scientific Workflow Management– more procedural than database mediation (often the scientist is

the query planner)– deployment using web services

70

B R E A K

... followed by demos ...

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 71

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 72

GEON SMART Metadata: Multihierarchical Rock Classification for “Thematic Queries” (GSC)

Composition

Genesis

Fabric

Texture

“smart discovery & querying” via multiple, independent concept hierarchies (controlled vocabularies)• data at different description levels can be found and processed

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 73

GEON SMART Metadata:Multihierarchical Rock Classification for “Thematic Queries”

http://klin-pc.sdsc.edu:8080/examples/jsp/geon/composition.jsp

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 74

GEON Ontology Demo

• http://klin-pc.sdsc.edu:8080/examples/jsp/geon/old-rock.jsp • http://klin-pc.sdsc.edu:8080/examples/jsp/geon/rock.jsp

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 75

Architecture of Ontology Based Map Integration

Ontology Mapping

Web Map Server Web Map Server Web Map Server

Database Database Database

Global Web Map Server

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 76

DOE Scientific Datamanagement Center

• Scientific Workflow Demo

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 77

Microarrayanalysis cDNA Cluster

Database search for promoter identification

Database search

A B C

Promoter model Common promoter alignment Promoter sequences

*- New candidate target genes

*

*

*

*

*

Adapted from Thomas Werner Biomolecular Engineering, 17: 87-94 (2001)

Example: A Scientific Workflow

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 78

Compute clusters(min. distance)

Select gene-set(cluster-level)

For each geneRetrieve

Transcription factors

ArrangeTranscription factors

For each promoter

ComputeSubsequence labels

With all Promoter Models

Compute JointPromoter Model

Retrieve matching cDNA

Retrieve genomicSequence

Extract promoterRegion(begin, end)

Create consensussequence

Align promoters

Conceptual Workflow

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 79

Mapping This Workflow To Web Sites

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 80

Customized CGI Application

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 81

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 82

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 83

ClustalW OutputTransfac Query Results

84Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure

AWF EWF

Design Execution monitoring

WF-Pilot

Abstract Task(AT) Repository

AAVrules C

C

C

Data & Parameter Ontologies

ETschemas

User

ET ET

Genbank BLAST

query rewriting

web serviceinvocation

Executable Task(ET) Repository

web serviceinvocation

semantic typechecking

conversionrules

data typeconversion

Datatype & Conversion Repository

web servicematching

WF-Compiler

AWF EWF Translation

WF-Engine

Scheduling and execution

SDM-SciDAC System Architecture

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 85

AWF to EWF

GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :-

GENBANK (+{selectedGene}, -{cDNASequence}),

BLAST (+{cDNASequence}, +dbName, +format, -{rankedGenomicSequenceList}).

GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :-

GENBANK (+{selectedGene}, -{cDNASequence}),

BLAT (+{cDNASequence}, +QueryType, +SortCriteria, +OutputType , -{rankedGenomicSequenceList}).

IdentifyPromoterElements (+{rankedGenomicSequenceList}, -{element}) :- PromoterSequences (+{rankedGenomicSequenceList}, getBeginEnd(+Species, -Begin, -End), -{element}).

For each gene

Retrieve matching cDNA

Retrieve genomicSequence

Extract promoterRegion(begin, end)

Same functionality, different operational constraints andavailability

Need extradomain knowledge

User supplied

Translation to EWF needscreation of iterators

Declarative specification

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 86

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 87

Abstract Task (AT) Registration

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 88

Abstract Task (AT) View and Delete

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 89

Abstract Task (AT) Update

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 90

AWF Design

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 91

EWF Planning and Compilation

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 92

EWF Execution

93

BIRN Tools Demo

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 94

Some References (starting points)• XML

– General: http://xml.coverpages.org/xml.html – XQuery: http://www.w3.org/XML/Query – XSLT: http://xml.coverpages.org/xsl.html

• Query Rewriting:– database research literature

• Logic Programming– Learn Prolog Now! http://www.coli.uni-sb.de/~kris/learn-prolog-now/ – SWI-Prolog (nice free Prolog system): http://www.swi-prolog.org/

• Ontologies– Ontology Web language: http://www.w3.org/TR/owl-features/ – http://www-ksl.stanford.edu/kst/what-is-an-ontology.html – http://www.cs.utexas.edu/users/mfkb/related.html

• Model-Based Mediation:– http://www.sdsc.edu/~ludaesch/Paper/icde01.html

• Semantic Web:– http://www.w3.org/2001/sw/

Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 95

References: Project Web Sites

• GEOsciences Network (NSF) – www.geongrid.org

• Biomedical Informatics Research Network (NIH)– www.nbirn.net

• Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org

• Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/