1 tutorial #5: scientific data integration and mediation san diego supercomputer center u.c. san...
TRANSCRIPT
1
Tutorial #5:Scientific Data Integration and Mediation
San Diego Supercomputer CenterSan Diego Supercomputer Center
U.C. San DiegoU.C. San Diego
Bertram LudBertram Ludääscherscher
Ilkay AltintasIlkay Altintas
Amarnath GuptaAmarnath Gupta
Kai LinKai Lin
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 2
Acknowledgements• National Science Foundation (NSF)– www.nsf.gov
• GEOsciences Network (NSF) – www.geongrid.org
• Biomedical Informatics Research Network (NIH)– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 3
Outline• 8:30 – 10:30am: Tutorial: Data Integration & Mediation
– Introduction to database mediation: • motivation and architecture• XML-based data integration
– Database mediation theory primer: • logic view definitions, view unfolding, computing feasible plans
– From XML-based to Knowledge-based mediation:• use of ontologies in data integration, ...
• 10:30 – 10:45am: BREAK
• 10:45 – 12:00: Applications and Demos– 10:45 – 11:05 Mediator Demo – 11:05 – 11:20 Queries w/ Ontology Support – 11:20 – 11:40 Scientific Workflows – 11:40 – 12:00 KNOW-ME Ontology Tool
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 4
Information Integration Challenges • System aspects: “Grid” Middleware
– distributed data & computing– Web Services, WSDL/SOAP, …– sources = functions, files, databases, …
• Syntax & Structure: XML-Based Mediators
– wrapping, restructuring – XML queries and views– sources = XML databases
• Semantics: Model-Based/Semantic Mediators
– conceptual models and declarative views – SemanticWeb/KnowledgeGrid stuff:
ontologies, description logics (RDF(S), DAML+OIL, OWL ...)
– sources = knowledge bases (DB+CMs+ICs)
SyntaxSyntax
StructureStructure
SemanticsSemantics
System aspectsSystem aspects
reconciling reconciling SS44 heterogeneitiesheterogeneities
““gluing” together gluing” together multiple data sources multiple data sources
bridging information bridging information and knowledge gaps and knowledge gaps computationallycomputationally
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 5
Information Integration from a DB Perspective
• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user
questions Q1,..., Qn that can be answered using the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...)
Si can be queried
define virtual (or materialized) integrated views V over S1 ,..., Sk using database query languages (SQL, XQuery,...)
questions become queries Qi against V(S1,..., Sk)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 6
Standard (XML-Based) Mediator Architecture
MEDIATORMEDIATOR
XML Queries & Results
S1
Wrapper
(XML) View
S2
Wrapper
(XML) View
Sk
Wrapper
(XML) View
Integrated Global(XML) View G
Integrated ViewDefinition
G(..) S1(..)…Sk(..)
USER/ClientUSER/Client
Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )
wrappers implementedas web services
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 7
• Data Integration Approaches:– Let’s just share data, e.g., link everything from a web page!– ... or better put everything into an relational or XML database– ... and do remote access using the Grid– ... or just use Web services!
• Nice try. But: – “Find the files where the amygdala was segmented.”– “Which other structures were segmented in the same files?”– “Did the volume of any of those structures differ much from
normal?”– “What is the cerebellar distribution of rat proteins with more
than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”
Some BIRNing Data Integration Questions
Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
?Information Integration
?Information Integration
addall.comaddall.com
“One-World”Mediation
“One-World”Mediation
amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com
WWWpublic library
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?Information Integration
?Information Integration
RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats
“Multiple-Worlds”Mediation
“Multiple-Worlds”Mediation
A Geoscientist’s Information Integration ProblemWhat is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry ? How does it relate to host rock structures?
?Information Integration
?Information Integration
Geologic Map(Virginia)
Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical
(gravity contours)
GeoPhysical(gravity contours)
GeoChronologic(Concordia)
GeoChronologic(Concordia)
Foliation Map(structure DB)
Foliation Map(structure DB)
“Complex Multiple-Worlds”
Mediation
“Complex Multiple-Worlds”
Mediation
A Neuroscientist’s Information Integration ProblemWhat is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?How about other rodents?
?Information Integration
?Information Integration
protein localization(NCMIR)
protein localization(NCMIR)
neurotransmission(SENSELAB)
neurotransmission(SENSELAB)
sequence info(CaPROT)
sequence info(CaPROT) morphometry
(SYNAPSE)
morphometry(SYNAPSE)
“Complex Multiple-Worlds”
Mediation
“Complex Multiple-Worlds”
Mediation
Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 12
Structural / XML-Based Mediation
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 13
Abstract XML-Based Mediator Architecture
S_1
MEDIATORMEDIATOR
XML Queries & Results
USER/ClientUSER/Client
Wrapper
XML View
S_2
Wrapper
XML View
S_k
Wrapper
XML View
IntegratedXML View V
Integrated ViewDefinition
IVD(S1,...,Sn)
Query Q o V (S_1,...,S_k)Query Q o V (S_1,...,S_k)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 14
Extensible Markup Language (XML)
• (meta)language for marking up text & data with user-definable tags– (X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, ...
• semistructured tree data model– flexible: marked-up text, web-pages,
databases, ...
• container model: – “boxes within boxes”
• (meta)language for marking up text & data with user-definable tags– (X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, ...
• semistructured tree data model– flexible: marked-up text, web-pages,
databases, ...
• container model: – “boxes within boxes”
... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how ...
author:“B. Schatz”
book:
title:“SemWeb Tractat”
author:“T.B. Lee”
book
title author
“SemWeb Tractat”
author
“B. Schatz”“T.B. Lee”
<book> <title>SemWeb Tractat</title> <author>B. Schatz</author> <author>T.B. Lee</author></book>
... in their wonderful book called <title>SemWeb Tractat </title> by B. Schatz and T.B. Lee, the authors show how ...
... in their wonderful book called <title>SemWeb Tractat</title> by <author>B. Schatz</author> and <author> T.B. Lee</author>, the authors show how ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 15
Example: Relational Data => XML
c2b2a2
c3b3a3
c1b1a1
CBA
R Rtuple
A a1 /AB b1 /BC c1 /C
/tupletuple
A a2 /AB b2 /BC c2 /C
/tuple …
/R
R
tuple
A B Ca1 b1 c1
tuple
A B Ca2 b2 c2
tuple
A B Ca3 b3 c3
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 16
Tag Names & Nesting => XML DTDs (Grammars)
<!ELEMENT bibliography paper*><!ELEMENT paper (authors,fullPaper?,title,booktitle)><!ELEMENT authors author+>
XML DTD
bibliography paper* paper authors fullPaper? title booktitle authors author+
Grammar Rules
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 17
XML DTDs vs. XML Schema
• XML DTDs– set of allowed tag names
– their nesting structure (via grammar rules)
• XML Schema– tag names and nesting structure
– user-defined complex data types
– subtyping (no multiple inheritance): RESTRICT and EXTEND
– separate “namespace” for type names and tag (=element) names
– ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 18
XML Schema: User-Defined Type/Class Hierarchy
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 19
XML Schema Declarations (“home-style” syntax)
Complex Type Declarations
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 20
XML Schema (“home-style”)
Complex Types
Simple Type Declarations
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 21
XML Schema: Substitution Groups
Elements of a substitution group (hexagons) and associated complex types (boxes)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 22
XML Schema Declarations (W3C syntax)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 23
XML Query Languages
• XPath:
– root//books/book[cover_style=“paperback”][price<80]
• XQuery– the W3C XML query language
• XSLT – XML transformations (XML=>HTML, XML=>XML)
• ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 24
Transforming and Rendering XML: XSLT
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 25
XMAS: XML Matching And Structuring language
Integrated View Definition:“Find books from amazon.com
and DBLP, join on author,group by authors and title”
CONSTRUCT <books> <book>
$a1$t<pubs>
$p { $p } </pubs>
</book> { $a1, $t } </books>WHERE <books.book>
$a1 : <author />$t : <title />
</> IN "amazon.com" AND <authors.author>
$a2 : <author /><pubs> $p : <pub/> </>
</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )
CONSTRUCT <books> <book>
$a1$t<pubs>
$p { $p } </pubs>
</book> { $a1, $t } </books>WHERE <books.book>
$a1 : <author />$t : <title />
</> IN "amazon.com" AND <authors.author>
$a2 : <author /><pubs> $p : <pub/> </>
</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )
XMAS
XMAS Algebra
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 27
Mediator Query Processing
TranslatorTranslator
Rewriter/OptimizerRewriter/Optimizer
composed plan
optimized plan
Query Q
Composition (Q o V)Composition (Q o V)
Integrated ViewDefinition V
parsed plan
Plan Execution Plan Execution
Compile-timeCompile-time
Run-timeRun-time
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 28
Logic View Definitions (Global-as-View) or
Querying and Reasoning with the Family ...
• Warm up: Who says this?– “Your are my son, but I’m not your father!”
• The mother!
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 29
Logic View Definitions (Global-as-View)• Globals-as-View (GAV)
– Integrated view V is defined in terms of the sources Src_1, ... , Src_k
• Given the following source databases:– Src_1 schema = { father(Father,Child), mother(Mother,Child) }
– Src_2 schema = { spouse(Spouse, Spouse) }
– Src_3 schema = { male(Person), female(Person) }
• Can you define integrated views V for ... ?– parent(Parent,Child)
• short: parent/2, i.e., table/relation name is ‘parent’, arity (#columns) is 2
– son/2, daughter/2
– brother/2, sister/2
– brother_in_law/2, sister_in_law/2
– aunt/2, uncle/2
– married/2, bachelor/2
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 30
Logic View Definitions (Global-as-View)Source relations: father/2, mother/2, spouse/2, male/1, female/1
= “,” = conjunction (and) = “ ; ” = disjunction (or)
= “not” = negation • parent(C,P) father(C,P) ; mother(C,P) . • son(P,S) parent(S,P) , male(S) .• brother(X,B) parent(X,P), son(P,B), X B .• brother_in_law(X,B) sister(X, Z), spouse(Z, B) ; spouse(X, Z), brother(Z, B) .
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 31
Logic View Definitions (Global-as-View)Source relations: father/2, mother/2, spouse/2, male/1, female/1
= “,” = conjunction (and) = “ ; ” = disjunction (or)
= “not” = negation • uncle(X, U) parent(X, Z), brother(Z, U) ; parent(X, Z), brother_in_law(Z, U) .• aunt(X, A) parent(X, Z), sister(Z, A) ; parent(X, Z), sister_in_law(Z, A) .• married(X) spouse(X, _) .• bachelor(X) [person(X)] , not married(X) .
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 32
Query Rewriting and Query Evaluation
• Query Rewriting:- Given a user query Q in terms of virtual views V...
- Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k
• Query Evaluation:- Given a query Q’, evaluate Q’ over the source databases
D := Src_1 ... Src_k
• Examples:– Q_uncle/2 = { (X,Y) | uncle(X,Y) holds in D }
– Q_tom’s_uncle/1 = { X | uncle(tom, X) holds in D }
– Q_whose_uncle_is_tom/1 = { X | uncle(X, tom) holds in D }
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 33
Query Rewriting (for GAV)
• Query rewriting:- Given a user query Q in terms of virtual views V...
- Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k
• Query Q, views V, source schemas S
• View unfolding:– starting with Q, repeatedly replace view predicates by the definition
• Creating a feasible plan:– here: compute disjunctive normal form (DNF)
– DNF = disjunction of conjunctions (= “union of joins”)
– order goals within each conjunction according to sources’ query capabilities
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 34
Example• ?- plan(brother(X0,X1)) .
brother(X0, X1)
== LQP ==>
(father(X0, X2) v mother(X0, X2))
& (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)
brother(X0, X1)
==NNF LQP==>
(father(X0, X2) v mother(X0, X2))
& (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 35
Example (Cont’d)• ?- plan(brother(X0,X1)) .
brother(X0, X1)
==DNF LQP==>
father(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)
v mother(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)
v father(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)
v mother(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 36
Example (Cont’d)• ?- plan(brother(X0,X1)) .
brother(X0, X1)
==Bp ordered LQP==>
parentDb(father(X1, X2) & father(X0, X2))
& genderDb(male(X1)) & mediator(neq(X0, X1))
v parentDb(father(X1, X2) & mother(X0, X2))
& genderDb(male(X1)) & mediator(neq(X0, X1))
v parentDb(mother(X1, X2)&father(X0,X2))
& genderDb(male(X1)) & z_mediator(neq(X0, X1))
v parentDb(mother(X1, X2)&mother(X0, X2))
& genderDb(male(X1))&z_mediator(neq(X0, X1))
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 37
Computing Feasible Plans (Goal Ordering)• A conjunctive query Q is an expression of the
form– q( X ) p1( X1 ) , ..., pn( Xn )– order of subgoals p_i is irrelevant
• An ordered plan P is an expression of the form– q( X ) [p1( X1 ) , ..., pn( Xn )]– order of subgoals p_i is important
• Problem:– given Q, compute P which is feasible, i.e., observes the limited
query capabilities of sources– Here: binding patterns, i.e., predicates’ arguments can be
• “b” – bound • “f” – free • “_” – bound or free
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 38
A Simple Algorithm for Ordering Goals
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 39
Query Containment• A query Q1 is contained in Q2, denoted Q1 Q2
– if for all possible database instances, the set of answers to Q1 is contained in the set of answers to Q2.
• Q1 and Q2 are called equivalent– if Q1 Q2 and Q2 Q1.
• Query containment is undecidable for many languages, e.g., for the relational calculus (SQL).
• For conjunctive queries, the problem is NP-complete (and thus decidable) – Since query sizes tend to be “small” (in particular, when
compared to database sizes), query containment is still of use in practice (indeed, it is one of the most fundamental tools for logic-based query optimization).
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 40
Query Containment• Q1(Xs,Ys) is contained in Q2(Xs,Zs) iff
ALL Xs: (EXISTS Ys: Q1(Xs,Ys)) (EXISTS Zs: Q2(Xs,Zs))
• iff we can refute its negation• iff
NOT ALL Xs: (EXISTS Ys: Q1(Xs,Ys)) (EXISTS Zs: Q2(Xs,Zs)) |= []
• iffEXISTS Xs: (EXISTS Ys: Q1(Xs,Ys)) AND NOT (EXISTS Zs: Q2(Xs,Zs)) |= []
• iff– canonical_db(Q1) AND Q2(Xs,Zs) |= []
• create database from Q1, then run Q2 as a query...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 41
Query Containment Algorithm (in Prolog)
• Applications: – query minimization (conjunctive query is minimal if not
conjunct can be dropped)– semantic query optimization
• Q denial • here: denial is an integrity constraint and states what must not hold• example: denial = false mother(X,M), father(Y,M)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 42
Example
• 50% of the clauses of the executable plan are irrelevant ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 43
Mediator Demo
• Computer Science Challenges:– Given a query Q over virtual integrated database V, how to come up with Q’
over the source schemas? (cf. Garlic, DiscoveryLink, ...)• query rewriting of Q(V) into Q’(SRCs) using unfolding and normalization• computation of feasible orders (NP-complete!?) while minimizing number of
“chunks” sent to sources• semantic query optimization (reasoning over plans!); e.g. conjunctive query
containment is NP-complete [Chandra-Merlin-77]
• A Quick Demo of the current prototype: – Find 3D reconstructions of cells found in ‘cerebellar cortex’:
• ?- ccdbData('cerebellar cortex').• Join everything reachable along ‘cerebellar-cortex’.(has-a)* in UMLS • ....with concept markup in CCDB• ... retrieve (links to) results• ... also show on SmartAtlas tool
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 44
Mediator Demo
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 45
From XML-Based to Logic and Model-Based (“Semantic”) Mediation
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 46
What’s the Problem with XML & Complex Multiple-Worlds?
• XML is Syntax– DTDs talk about element nesting– XML Schema schemas give you data types – need anything else? => write comments!
• Domain Semantics is complex:– implicit assumptions, hidden semantics sources seem unrelated to the non-expert
• Need Structure and Semantics beyond XML trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 47
From XML-Based to Model-Based Mediation• Data and Knowledge Sharing Potential:
Database Mediation + Knowledge Representation
________________________
= Model-Based Mediation
• Basic Ideas:– turn primary data sources into knowledge sources– employ secondary glue knowledge sources
• generic: UMLS, ... • specific: community/laboratory ontologies
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 48
DB mediation techniques
OntologiesKR formalisms
Model-Based Mediation
Information Integration Landscape
conceptual distanceone-world multiple-worlds
conceptual complexity/depth
low
high
addallbook-buyer
BLAST
EcoCyc
Cyc
WordNet
GO
home-buyer24x7 consumer
UMLS
MIA Entrez
RiboWeb
Tambis
BioinformaticsGeoinformatics
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 49
Knowledge Representation:Relating Theory to the World via Formal Models
John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations
“All models are wrong, but some are useful!”
XML-Based vs. Model-Based Mediation
Raw DataRaw DataRaw Data
IF THEN IF THEN IF THEN
LogicalDomainConstraints
Integrated-CM :=
CM-QL(Src1-CM,...)
Integrated-CM :=
CM-QL(Src1-CM,...)
. . ....
....
........ (XML)Objects
Conceptual Models
XMLElements
XML Models
C2 C3
C1
R
Classes,Relations,is-a, has-a, ...
Glue Maps
DMs, PMs
Glue Maps
DMs, PMs
Integrated-DTD :=
XML-QL(Src1-DTD,...)
Integrated-DTD :=
XML-QL(Src1-DTD,...)
No DomainConstraints
A = (B*|C),DB = ...
Structural Constraints (DTDs),Parent, Child, Sibling, ...
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 51
What’s the Glue? What’s in a Link?
• Syntactic Joins (X,Y) := X.SSN = Y.SSN equality (X,Y) := X.UMLS-ID = Y.UID
• “Speciality” Joins (X,Y,Score) := BLAST(X,Y,Score) similarity
• Semantic/Rule-Based Joins (X,Y,C) :=
X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub
(X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y. rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• Challenge: – compile semantic joins into efficient syntactic ones
XY
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 52
Model-Based Mediation Methodology ...
• Lift Sources to export CMs:
CM(S) = OM(S) + KB(S) + CON(S)
• Object Model OM(S):– complex objects (frames), class hierarchy, OO constraints
• Knowledge Base KB(S):– explicit representation of (“hidden”) source semantics
– logic rules over OM(S)
• Contextualization CON(S):– situate OM(S) data using “glue maps” (GMs): domain maps DMs (ontology)
= terminological knowledge: concepts + roles process maps PMs
= “procedural knowledge”: states + transitions
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 53
... Model-Based Mediation Methodology
• Integrated View Definition (IVD)– declarative (logic) rules with object-oriented features
– defined over CM(S), domain maps, process maps
– needs “mediation engineers” = domain + KRDB experts
• Knowledge-Based Querying and Browsing (runtime):– mediator composes the user query Q with the IVD
... rewrites (Q o IVD), sends subqueries to sources
... post-processes returned results (e.g., situate in context)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 54
S1 S2
S3
(XML-Wrapper) (XML-Wrapper) (XML-Wrapper)
CM-Wrapper CM-Wrapper CM-Wrapper
USER/ClientUSER/Client
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
CM(S) =OM(S)+KB(S)+CON(S)
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM Queries & Results (exchanged in XML)
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Process MapsPMs
“Glue” MapsGMs
semanticcontextCON(S)
Integrated View Definition IVD
Model-Based Mediator Architecture
First results & Demos:KIND prototype, formal
DM semantics, PMs[SSDBM00] [VLDB00][ICDE01] [NIH-HB01]
(w/ Gupta, Martone)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 55
Domain Maps (Ontologies) as Glue Knowledge Sources
• Domain Map = Ontology– representation of terminological knowledge
• Use in Model-Based Mediation– (derived) concepts as “drop points”, “anchor points”, “context”
for source classes
– compile-time use: view definition, subsumption, classification,...
– runtime use: querying/deduction, path queries, ....
• Formalisms:– Semantic nets, Thesauri, Frame-logic, Description logics, ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 56
Ontologies• So what is an Ontology?
– definition of things that are relevant to your application– representation of terminological knowledge (“TBox”)– explicit specification of a conceptualization– concept hierarchy (“is-a”)– further semantic relationships between concepts– abstractions of relational schemas, (E)ER, UML classes, XML
Schemas
• Examples:– NCMIR ANATOM– GO (Gene Ontology)– UMLS (Unified Medical Language System– CYC
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 57
Formalism for Ontologies: Description Logic
• DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 58
Description Logic Statements as Rules
• In first-order logic (rule form):happyFather(X)
man(X), child(X,C1), child(X,C2), blue(C1), green(C2),
not ( child(X,C3), poorunhappyChild(C3) ).
poorunhappyChild(C)
not rich(C), not happy(C).
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 59
Description Logics
• Terminological Knowledge (TBox)– Concept Definition (naming of concepts):
– Axiom (constraining of concepts):
=> a mediators “glue knowledge source”
• Assertional Knowledge (ABox)– the marked neuron in image 27
=> the concrete instances/individuals of the concepts/classes that your sources export
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 60
Querying vs. Reasoning
• Querying: – given a DB instance I (= logic interpretation), evaluate a query
expression (e.g. SQL, FO formula, Prolog program, ...)– boolean query: check if I |= (i.e., if I is a model of ) – (ternary) query: { (X, Y, Z) | I |= (X,Y,Z) } => check happyFathers in a given database
• Reasoning:– check if I |= implies I |= for all databases I, – i.e., if => – undecidable for FO, F-logic, etc.– Descriptions Logics are decidable fragments concept subsumption, concept hierarchy, classification semantic tableaux, resolution, specialized algorithms
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 61
What’s in an Answer?(What’s in a Link? revisited)
• Semantic/Rule-Based Joins
(X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y. rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• What is the Erdoes number of person P? – 3
• Really? Why?– authority based: <VIP> said so– faith based: don’t know but firmly believe– query statement Q = ... derived it from DB I– query Q = ... derived it from DB I and KB T using derivation D=> logic-based systems often “come with explanations”
(“computations as proofs”)
XY
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 62
Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map (DM)
Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).
Domain Expert Knowledge
DM in Description Logic
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 63
Source Contextualization & DM Refinement
In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...
sources can register new concepts at the mediator ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 65
Browsing Registered Data with Domain Maps
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 66
Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue
• nodes ~ states• edges ~ processes, transitions• blue/red edges:
• processes in Src1/Src2• general form of edges:
related formalisms
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 67
Summary: Mediation Scenarios & TechniquesFederated Databases XML-Based Mediation Model-Based Mediation
One-World One-/Multiple-Worlds Complex Multiple-Worlds
Common Schema Mediated Schema Common Glue Maps
SQL, rules XML query languages DOOD query languages
Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings
Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps
DB expert DB expert KRDB + domain expert
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 68
Semantic (Community) Webs “Within the next decade, computing
technology will transform the Internet into the Interspace, an information
infrastructure that supports semantics indexing and concept navigation across
widely distributed community repositories.”
Bruce Schatz, IEEE Computer, Jan. 2002
"The Semantic Web is an extension of the current web in which information is given
well-defined meaning, better enabling computers and people to work in
cooperation."Tim Berners-Lee et al., 2001
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 69
Combine Everything:Die eierlegende Wollmilchsau:
• Database Federation/Mediation– query rewriting under GAV/LAV – w/ binding pattern constraints– distributed query processing
• Semantic Mediation– semantic integrity constraints, reasoning w/ plans, automated
deduction– deductive database/logic programming technology, AI “stuff”...– Semantic Web technology
• Scientific Workflow Management– more procedural than database mediation (often the scientist is
the query planner)– deployment using web services
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 72
GEON SMART Metadata: Multihierarchical Rock Classification for “Thematic Queries” (GSC)
Composition
Genesis
Fabric
Texture
“smart discovery & querying” via multiple, independent concept hierarchies (controlled vocabularies)• data at different description levels can be found and processed
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 73
GEON SMART Metadata:Multihierarchical Rock Classification for “Thematic Queries”
http://klin-pc.sdsc.edu:8080/examples/jsp/geon/composition.jsp
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 74
GEON Ontology Demo
• http://klin-pc.sdsc.edu:8080/examples/jsp/geon/old-rock.jsp • http://klin-pc.sdsc.edu:8080/examples/jsp/geon/rock.jsp
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 75
Architecture of Ontology Based Map Integration
Ontology Mapping
Web Map Server Web Map Server Web Map Server
Database Database Database
Global Web Map Server
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 76
DOE Scientific Datamanagement Center
• Scientific Workflow Demo
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 77
Microarrayanalysis cDNA Cluster
Database search for promoter identification
Database search
A B C
Promoter model Common promoter alignment Promoter sequences
*- New candidate target genes
*
*
*
*
*
Adapted from Thomas Werner Biomolecular Engineering, 17: 87-94 (2001)
Example: A Scientific Workflow
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 78
Compute clusters(min. distance)
Select gene-set(cluster-level)
For each geneRetrieve
Transcription factors
ArrangeTranscription factors
For each promoter
ComputeSubsequence labels
With all Promoter Models
Compute JointPromoter Model
Retrieve matching cDNA
Retrieve genomicSequence
Extract promoterRegion(begin, end)
Create consensussequence
Align promoters
Conceptual Workflow
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 79
Mapping This Workflow To Web Sites
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 80
Customized CGI Application
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 83
ClustalW OutputTransfac Query Results
84Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure
AWF EWF
Design Execution monitoring
WF-Pilot
Abstract Task(AT) Repository
AAVrules C
C
C
Data & Parameter Ontologies
ETschemas
User
ET ET
Genbank BLAST
query rewriting
web serviceinvocation
Executable Task(ET) Repository
web serviceinvocation
semantic typechecking
conversionrules
data typeconversion
Datatype & Conversion Repository
web servicematching
WF-Compiler
AWF EWF Translation
WF-Engine
Scheduling and execution
SDM-SciDAC System Architecture
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 85
AWF to EWF
GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :-
GENBANK (+{selectedGene}, -{cDNASequence}),
BLAST (+{cDNASequence}, +dbName, +format, -{rankedGenomicSequenceList}).
GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :-
GENBANK (+{selectedGene}, -{cDNASequence}),
BLAT (+{cDNASequence}, +QueryType, +SortCriteria, +OutputType , -{rankedGenomicSequenceList}).
IdentifyPromoterElements (+{rankedGenomicSequenceList}, -{element}) :- PromoterSequences (+{rankedGenomicSequenceList}, getBeginEnd(+Species, -Begin, -End), -{element}).
For each gene
Retrieve matching cDNA
Retrieve genomicSequence
Extract promoterRegion(begin, end)
Same functionality, different operational constraints andavailability
Need extradomain knowledge
User supplied
Translation to EWF needscreation of iterators
Declarative specification
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 87
Abstract Task (AT) Registration
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 88
Abstract Task (AT) View and Delete
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 89
Abstract Task (AT) Update
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 90
AWF Design
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 91
EWF Planning and Compilation
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 92
EWF Execution
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 94
Some References (starting points)• XML
– General: http://xml.coverpages.org/xml.html – XQuery: http://www.w3.org/XML/Query – XSLT: http://xml.coverpages.org/xsl.html
• Query Rewriting:– database research literature
• Logic Programming– Learn Prolog Now! http://www.coli.uni-sb.de/~kris/learn-prolog-now/ – SWI-Prolog (nice free Prolog system): http://www.swi-prolog.org/
• Ontologies– Ontology Web language: http://www.w3.org/TR/owl-features/ – http://www-ksl.stanford.edu/kst/what-is-an-ontology.html – http://www.cs.utexas.edu/users/mfkb/related.html
• Model-Based Mediation:– http://www.sdsc.edu/~ludaesch/Paper/icde01.html
• Semantic Web:– http://www.w3.org/2001/sw/
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 95
References: Project Web Sites
• GEOsciences Network (NSF) – www.geongrid.org
• Biomedical Informatics Research Network (NIH)– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/