from database federation to model-based mediation: databases meets * knowledge representation
DESCRIPTION
* or rather rediscovers. From Database Federation to Model-Based Mediation: Databases Meets * Knowledge Representation. Bertram Lud ä scher [email protected] Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego. Outline. - PowerPoint PPT PresentationTRANSCRIPT
From Database Federation to Model-Based Mediation: From Database Federation to Model-Based Mediation: Databases MeetsDatabases Meets** Knowledge Representation Knowledge Representation
Bertram Ludä[email protected]
Data and Knowledge Systems
San Diego Supercomputer Center
U.C. San Diego
* * or rather or rather rediscoversrediscovers
2
Outline
• Information Integration from a database perspective– examples, mediator approach, some technical challenges
• Part I: XML-Based Mediation – based on querying semistructured data & XML
– navigation-driven query evaluation
– ongoing/future research: querying XML streams
• Part II: Model-Based Mediation– basic ideas & architecture, lifting data to knowledge sources
– “glue maps” (domain maps, process maps)
– ongoing/future research: mix of DB & KR techniques
• Summary
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
?Information Integration
?Information Integration
addall.comaddall.com
“One-World”Mediation
“One-World”Mediation
amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com
WWWpublic library
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?Information Integration
?Information Integration
RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats
“Multiple-Worlds”Mediation
“Multiple-Worlds”Mediation
5
Information Integration from a DB Perspective
• Information Integration Challenge– Given: data sources S_1, ..., S_k (DBMS, web sites, ...) and
user questions Q_1,...,Q_n that can be answered using the S_i
– Find: the answers to Q_1, ..., Q_n
• The Database Perspective: source = “database” S_i has a schema (relational, XML, OO, ...) S_i can be queried define virtual (or materialized) integrated views V over
S_1,...,S_k using database query languages questions become queries Q_i against V(S_1,...,S_k)
• Why a Database Perspective?– scalability, efficiency, reusability (declarative queries), ...
6
PART I: XML-Based Mediation
7
Abstract XML-Based Mediator Architecture
S_1
MEDIATORMEDIATOR
XML Queries & Results
USER/ClientUSER/Client
Wrapper
XML View
S_2
Wrapper
XML View
S_k
Wrapper
XML View
IntegratedXML View V
Integrated ViewDefinition
IVD(S1,...,Sn)
Query Q o V (S_1,...,S_k)Query Q o V (S_1,...,S_k)
8
A Concrete (Future) XML-Based Mediator System
S1 S2
S3
XML (Integrated View)
MEDIATOREngine
XQuery Processor
Integrated View Definition IVD
XML Queries & Results
XQuery
XPATH
XQuery
XSLT
XQuery
XSQL
USER/ClientUSER/Client
XML-Wrapper
XQuery
XQuery
XScan
XPath
SQL
XSQL
http-get
XSLT
XML-Wrapper XML-Wrapper
First Results & Demos:XMAS language and algebra,
VXD evaluation, BBQ UI,[WebDB99] [SSD99]
[SIGMOD99] [EDBT00] (w/ Papakonstantinou, Vianu, ...)
9
Some Technical Challenges ...
• XML Query Languages– DB community: QLs for semistructured data, e.g.,
TSIMMIS/MSL, Lorel, Yatl, ..., Florid/F-logic [InfSystems98] – CSE/SDSC: XMAS [SSD99,WebDB99,EDBT00]
– W3C: XPath, XSLT, XQuery (Working Draft , June 2001)
• DB Theory: Expressiveness/Complexity Trade-Off– querying: FO, (WF/S-)Datalog, FO(LFP), FO(PFP), ... , all
– reasoning: query satisfiability, containment, equivalence
10
... Some More Technical Challenges ...
• DB Practice: Query Composition– compute Q o V(S_1,...,S_k) w/o computing all of V “push Q through V into S_i” in Datalog: view unfolding (resolution, unification) +
simplification ~ top-down evaluation ~ magic sets in XML: some solutions (Papakonstantinou, ...)
• Navigation-Driven Evaluation of Integrated View V:– V materialized => warehousing approach
– V virtual => mediator approach
– V virtual & driven by user-navigation => VXD approach [EDBT00] (w/ Papakonstantinou, Velikhov)
11
XMAS: XML Matching And Structuring language
Integrated View Definition:“Find books from amazon.com
and DBLP, join on author,group by authors and title”
CONSTRUCT <books> <book>
$a1$t<pubs>
$p { $p } </pubs>
</book> { $a1, $t } </books>WHERE <books.book>
$a1 : <author />$t : <title />
</> IN "amazon.com" AND <authors.author>
$a2 : <author /><pubs> $p : <pub/> </>
</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )
CONSTRUCT <books> <book>
$a1$t<pubs>
$p { $p } </pubs>
</book> { $a1, $t } </books>WHERE <books.book>
$a1 : <author />$t : <title />
</> IN "amazon.com" AND <authors.author>
$a2 : <author /><pubs> $p : <pub/> </>
</> IN "www...DBLP… "AND value( $a1 ) = value( $a2 )
XMAS
XMAS Algebra
12
XML (XMAS) Query Processing
TranslatorTranslator
Rewriter/OptimizerRewriter/Optimizer
composed plan
optimized plan
XMAS Query Q
Composition (Q o V)Composition (Q o V)
XMAS ViewDefinition V
algebraic plans
Plan Execution Plan Execution
Compile-timeCompile-time
Run-time: lazy
VXD evaluation
Run-time: lazy
VXD evaluation
13
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
14
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
15
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
16
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
17
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
18
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
19
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
20
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
21
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
22
...S_1S_1 S_k S_k
XML source XML source
result
Lazy MediatorLazy Mediator
view definitionans = V( S_1 … S_k )
view definitionans = V( S_1 … S_k )
Input: client navigations
Output: sourcenavigations
Navigation-Driven Evaluation: Lazy Mediators
25
Open Issue: Querying XML Streams
• Given:– stream S of XML events (open, close, data)
– XML query Q over S
– constraints: 1-pass “on-the-fly” processing, bounded memory
• Find:– decide whether, and if so how, Q can be evaluated given the
constraints
• Initial Approach:– transducer model XSM (XML Stream Machine) to approximate
“streamable” queries (w/ Papakonstantinou, Mukhopadhyay, Vianu)
26
Example: XML Stream Query
XML query (r) = for each customer $C, list all orders $O
Query-aware DTD design is even more important for stream queries!
27
Example: XML Stream Machine (XSM)
input/output: stream of XML events
memory: finite state control, buffers,
transitions: on EVENT do ACTION
transducer model
28
PART II: Model-Based Mediation
A Geoscientist’s Information Integration Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ?
How does it relate to host rock structures?
?Information Integration
?Information Integration
Geologic Map(Virginia)
Geologic Map(Virginia) GeoChemicalGeoChemical GeoPhysical
(gravity contours)
GeoPhysical(gravity contours)
GeoChronologic(Concordia)
GeoChronologic(Concordia)
Foliation Map(structure DB)
Foliation Map(structure DB)
“Complex Multiple-Worlds”
Mediation
“Complex Multiple-Worlds”
Mediation
A Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?
How about other rodents?
?Information Integration
?Information Integration
protein localization(NCMIR)
protein localization(NCMIR)
neurotransmission(SENSELAB)
neurotransmission(SENSELAB)
sequence info(CaPROT)
sequence info(CaPROT) morphometry
(SYNAPSE)
morphometry(SYNAPSE)
“Complex Multiple-Worlds”
Mediation
“Complex Multiple-Worlds”
Mediation
31
What’s the Problem with XML & Complex Multiple-Worlds?
• XML is Syntax– canonical syntax for labeled ordered trees– a metalanguage, but all semantics lies outside of XML
• DTDs => tags + nesting, XML Schema => DTDs + data modeling • need anything else? => write comments!
• Domain Semantics is complex:– implicit assumptions, hidden semantics sources seem unrelated to the non-expert
• Need Structure and Semantics beyond XML trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics
32
DB mediation techniques
OntologiesKR formalisms
Model-Based Mediation
Information Integration Landscape
conceptual distanceone-world multiple-worlds
conceptual complexity/depth
low
high
addallbook-buyer
BLAST
EcoCyc
Cyc
WordNet
GO
home-buyer24x7 consumer
UMLS
MIA Entrez
RiboWeb
Tambis
BioinformaticsGeoinformatics
XML-Based vs. Model-Based Mediation
Raw DataRaw DataRaw Data
IF THEN IF THEN IF THEN
LogicalDomainConstraints
Integrated-CM :=
CM-QL(Src1-CM,...)
Integrated-CM :=
CM-QL(Src1-CM,...)
. . ....
....
........ (XML)Objects
Conceptual Models
XMLElements
XML Models
C2 C3
C1
R
Classes,Relations,is-a, has-a, ...
Glue Maps
DMs, PMs
Glue Maps
DMs, PMs
Integrated-DTD :=
XML-QL(Src1-DTD,...)
Integrated-DTD :=
XML-QL(Src1-DTD,...)
No DomainConstraints
A = (B*|C),DB = ...
Structural Constraints (DTDs),Parent, Child, Sibling, ...
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}
36
What’s the Glue? What’s in a Link?
• Syntactic Joins (X,Y) := X.SSN = Y.SSN equality (X,Y) := X.UMLS-ID = Y.UID
• “Speciality” Joins (X,Y,Score) := BLAST(X,Y,Score) similarity
• Semantic/Rule-Based Joins (X,Y,C) :=
X isa C, Y isa C, BLAST(X,Y,S), S>0.8 homology, lub (X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y. rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• YAC (Yet Another Challenge): – compile semantic joins into efficient syntactic ones
XY
37
Model-Based Mediation Methodology ...
• Lift Sources to export CMs:
CM(S) = OM(S) + KB(S) + CON(S)
• Object Model OM(S):– complex objects (frames), class hierarchy, OO constraints
• Knowledge Base KB(S):– explicit representation of (“hidden”) source semantics
– logic rules over OM(S)
• Contextualization CON(S):– situate OM(S) data using “glue maps” (GMs): domain maps DMs (ontology)
= terminological knowledge: concepts + roles process maps PMs
= “procedural knowledge”: states + transitions
38
... Model-Based Mediation Methodology
• Integrated View Definition (IVD)– declarative (logic) rules with object-oriented features
– defined over CM(S), domain maps, process maps
– needs “mediation engineers” = domain + KRDB experts
• Knowledge-Based Querying and Browsing (runtime):– mediator composes the user query Q with the IVD
... rewrites (Q o IVD), sends subqueries to sources
... post-processes returned results (e.g., situate in context)
39
S1 S2
S3
(XML-Wrapper) (XML-Wrapper) (XML-Wrapper)
CM-Wrapper CM-Wrapper CM-Wrapper
USER/ClientUSER/Client
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
CM(S) =OM(S)+KB(S)+CON(S)
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM Queries & Results (exchanged in XML)
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Domain MapsDMs
Process MapsPMs
“Glue” MapsGMs
semanticcontextCON(S)
Integrated View Definition IVD
Model-Based Mediator Architecture
First results & Demos:KIND prototype, formal
DM semantics, PMs[SSDBM00] [VLDB00][ICDE01] [NIH-HB01]
(w/ Gupta, Martone)
40
Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map (DM)
Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).
Domain Expert Knowledge
DM in Description Logic
41
Source Contextualization & DM Refinement
In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...
sources can register new concepts at the mediator ...
Example:ANATOM Domain Map
43
Browsing Registered Data with Domain Maps
44
Compilation : Domain Maps => F-Logic Rules
• Domain Maps ~ Ontologies• DMs have a formal semantics via a translation to F-
Logic (~ Datalog + OO features)
=> Declarative + “Executable” Specification• query evaluation with deductive rules• reasoning over decidable fragments:
• checking concept subsumption, equivalence
Query Processing “Demo”
Query resultsin context
ContextualizationCON(Result) wrt. ANATOM.
Integrated View DefinitionIntegrated View DefinitionDERIVEprotein_distribution(Protein, Organism, Brain_region, Feature_name,
Anatom, Value) IFI:protein_label_image[ proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>{AS:anatomical_structure[name->Anatom]}] , % from
PROLAB
NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value].
DERIVEprotein_distribution(Protein, Organism, Brain_region, Feature_name,
Anatom, Value) IFI:protein_label_image[ proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>{AS:anatomical_structure[name->Anatom]}] , % from
PROLAB
NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value].
• provided by the domain expert and mediation engineer• deductive OO language (here: F-logic)
• provided by the domain expert and mediation engineer• deductive OO language (here: F-logic)
Example: Inside Query Evaluation
push selection
@SENSELAB: X1 := select targets of “output from parallel fiber” ;
determine source context
@MEDIATOR: X2 := “find and situate” X1 in ANATOM Domain Map;
compute region of interest (here: downward closure)
@MEDIATOR: X3 := subregion-closure(X2);
push selection
@NCMIR: X4 := select PROT-data(X3, Ryanodine Receptors);
compute protein distribution
@MEDIATOR: X5 := compute aggregate(X4);
display in context
@MEDIATOR/GUI: display X5 in context (ANATOM)
"How does the parallel fiber output (Yale/SENSELAB) relate to the
distribution of Ryanodine Receptors (UCSD/NCMIR)?”
47
Some Open Database & Knowledge Representation Issues
• Mix of Query Processing and Reasoning– FaCT description logic reasoner for DMs?– or reconcilation of DMs via argumentation-frameworks
(“games”) using well-founded and stable models of logic programs [ICDT97,PODS97,TCS00]
• Modeling “Process Knowledge” => Process Maps– formal semantics? (dynamic/temporal/Kripke models?)– executable semantics? (Statelog?)
• Graph Queries over DMs and PMs– expressible in F-logic [InfSystem98]– scalability? (UMLS Domain Map has millions of entries)
• ...
48
Towards Process Maps with Abstractions and Elaborations
• nodes ~ states• edges ~ processes, transitions• blue/red edges:
• processes in Src1/Src2• general form of edges:
49
Summary: Mediation Scenarios & Techniques
Federated Databases XML-Based Mediation Model-Based Mediation
One-World One-/Multiple-Worlds Complex Multiple-Worlds
Common Schema Mediated Schema Common Glue Maps
SQL, rules XML query languages DOOD query languages
Schema Transformations Syntax-Aware Mappings Semantics-Aware Mappings
Syntactic Joins Syntactic Joins “Semantic” Joins via Glue Maps
DB expert DB expert KRDB + domain expert
50
Questions?
Queries?
51
Some References
• XML-Based and Model-Based Mediation:– MBM: Model-Based Mediation with Domain Maps, B. Ludäscher, A. Gupta, M. E. Martone,
17th Intl. Conference on Data Engineering (ICDE), Heidelberg, Germany, IEEE Computer Society,2001.
– VXD/Lazy Mediaors: Navigation-Driven Evaluation of Virtual Mediated Views, B. Ludäscher, Y. Papakonstantinou, P. Velikhov, Intl. Conference on Extending Database Technology (EDBT), Konstanz, Germany, LNCS 1777, Springer, 2000.
– DOOD: Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective, B. Ludäscher, R. Himmeröder, G. Lausen, W. May, C. Schlepphorst, Information Systems, 23(8), Special Issue on Semistructured Data, 1998.
• STATELOG (Logic Programming with States)– On Active Deductive Databases: The Statelog Approach, G. Lausen, B. Ludäscher, and W.
May. In Transactions and Change in Logic Databases, Hendrik Decker, Burkhard Freitag, Michael Kifer, and Andrei Voronkov, editors. LNCS 1472, Springer, 1998.
• Argumentation Frameworks as Games – Games and Total DatalogNeg Queries, J. Flum, M. Kubierschky, B. Ludäscher,
Theoretical Computer Science, 239(2), pp.257-276, Elsevier, 2000.
– Referential Actions as Logical Rules, B. Ludäscher, W. May, G. Lausen, Proc. 16th ACM Symposium on Principles of Database Systems (PODS'97), Tucson, Arizona, ACM Press, 1997.