![Page 1: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/1.jpg)
1
Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems
Chen Li
Information and Computer ScienceUniversity of California, Irvine
![Page 2: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/2.jpg)
2
Constraints in Data Integration
Sources of Orange County (OC) housing information
house(street,zip,price,sqft,year)
Mediator
s1(street, zip, price, sqft) s2(street, price, year)
Constraints: — s1 price at least $250K— s2 price at least $280K
Query: — “Find Orange County houses cheaper than $200K”— Answer: empty, without checking the source table instances
![Page 3: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/3.jpg)
3
Another Example
house(street,zip,price,sqft,year)
Mediator
s1(street, zip, price, sqft) s2(street, price, year)
Constraint: OC houses have a unique street Query:
— “Find sqft and year of OC houses”— Plan: join two source relations on the “street”
attributes— The plan is invalid if the constraint is not true.
![Page 4: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/4.jpg)
4
Importance of Constraints They express a rich amount of
information about sources They can be utilized to help query
answering
![Page 5: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/5.jpg)
5
Our contributions There has been a lot of work on how to use
constraints in query optimization, e.g.:— [Hsu, Knoblock, 2000]— [Godfrey, Gryz, Zuzarte, 2001]— …
To make this optimization possible, we primarily study:
— how to describe constraints; and — how to manipulate constraints between “local” and
“global”
![Page 6: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/6.jpg)
6
Outline Constraints motivation Local constraints Global constraints Conclusions and open problems
![Page 7: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/7.jpg)
7
Describing various constraints
Examples:— C1: s1.price >= $250K— C2: s1.street is unique— C3: s2.year < 1995, s2.price >= $280K— C4: street is unique for all OC houses— C5: all houses in the system are at least $250K
Where to put them?— Some constraints can be described at sources “locally”
(C1,C2,C3)— Others are more suitable to be described “globally” (C4,C5)
house(street,zip,price,year,sqft)
s1(street, zip, price, sqft) s2(street, price, year)
![Page 8: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/8.jpg)
8
Local constraints: described at sources
Designed by individual sources Common local constraints:
— Range constraints “price >= $250K”
— Enumeration constraints “state in {CA, NV, AZ, OR}”
— Functional dependencies “(street, zip) (price, year)” Key constraints
• “(street) is a primary key of house predicate”— Foreign-key constraints among tables at a source— Inclusions— …
![Page 9: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/9.jpg)
9
Other ways to describe local constraints
Could be described using traditional source-view definitions
E.g., in the LAV (local-as-view) approach to data integration:“C1: s1.price >= $250K” can be described ass1(street, zip, price, sqft) :- house(street, zip, price, sqft, year),price >=
250K
![Page 10: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/10.jpg)
10
Then why not use view/query languages to describe local constraints?
View/query languages might not be expressive enough
— E.g., functional-dependency constraints Query answering could become complicated
— “Conjunctive-query rewriting” is already NP— Arithmetic comparisons even require recursive
queries Describing constraints separately have
advantages:— More expressive: can capture common knowledge— We care more about those simple, common
constraints— Easier to understand and do reasoning
![Page 11: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/11.jpg)
11
Limitations of local constraints
— C4: street is unique for all OC houses Cannot describe C4 using two local constraints:
— “street is a key at s1” & “street is a key at s1”— Wrong! Since they cannot restrict two relations
together!— In particular, the following is still allowed
S1: (main, 92697, $300K, 2100) S2: (main, $380K, 1993)
house(street,zip,price,year,sqft)
s1(street, zip, price, sqft) s2(street, price, year)
![Page 12: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/12.jpg)
12
Global constraints (GC) Some constraints are more
suitable to be described “globally” Example:
— C4: street is unique for all OC houses— C5: all houses in the system are at least
$250K
![Page 13: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/13.jpg)
13
Describing global constraints
Might need more expressive languages:— “street” is unique for houses at s1 and s2— Need to formally define such a “global key”
Consider special cases:— We have source schemas and mediator schema— Mappings exist between them: e.g., LAV, GAV,
GLAV, or more general mapping language— In this case, global constraints could be easier
to describe
![Page 14: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/14.jpg)
14
Other advantages of GC They “summarize” source contents Mediators can check queries against these
conditions, before checking individual sources (thus could avoid unnecessary source checking)
Users get an overview of the data (easier to ask queries)
Give “outside world” a view of the source contents
— Especially useful when the mediation system is used as a component of a larger system
— E.g., peer-based mediation systems (Raccoon, Piazza, Hyperion, PeerDB, …) or hierarchies of mediators
![Page 15: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/15.jpg)
15
Two kinds of global constraints
General global constraints Source-derived global constraints
![Page 16: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/16.jpg)
16
General global constraints Conditions that should be satisfied by any database
instance of a global predicate: — e.g., C4: street is unique for all OC houses— It can be represented as the general global constraint on
the house predicate (street) forms a key of house predicate
Introduced during the system design to capture the semantics of the application domain
Future new sources expected to satisfy this constraint
Thus: may need to check if it is satisfied by existing and new-coming sources
![Page 17: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/17.jpg)
17
Source-derived global constraints (Example)
Local constraints:— C1: s1.price >= $250K— C3: s2.year < 1995, s2.price >= $200K Global constraint C5: house.price >= $200K
C5 is true only when the system has the two sources
In general, there could be a house (not at s1 and s2) that is less than $250K
— We don’t care about these houses, pretending they didn’t exist
When future sources come in, we need to check and update this constraint
![Page 18: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/18.jpg)
18
Source-derived global constraints
It is a condition on global predicates It must be satisfied by any derived database
D of any source view instances satisfying those local constraints.
Derived database D: certain tuples that can be decided based on the view definitions
— Depends on the view definition (LAV, GAV, …)— s1(street, zip, price, sqft) :-
house(street, zip, price, sqft, year), price >= 250K
![Page 19: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/19.jpg)
19
Computing Source-derived GC
Input:— Sources with their local constraints— Mappings between source schema and
the mediator schema Output: source-derived GC on
mediator schema We have preliminary results for the
LAV approach
![Page 20: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/20.jpg)
20
Comparisons
General GC Source-derived GC
When hold? Hold in general Hold for the system with current sources
When new sources join..
Assume new sources satisfy the GC. Could be validated.
New sources could violate the GC. Need to recalculate
![Page 21: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/21.jpg)
21
Conclusions Describing constraints in data-integration system is
important We classified different types of constraints:
— Local constraints— Global constraints:
General Source-derived
We showed their advantages and limitations We studied how to manipulate these constraints
(e.g., compute source-derived global constraints from local constraints)
![Page 22: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/22.jpg)
22
Open problems Expressive languages to describe cross-
source constraints— “tuple-generating dependencies”?— Other simpler but powerful languages
Manipulating the constraints in general— LAV, GAV— more expressive mappings between sources and
mediators Efficient techniques for testing and re-
computing of global constraints as sources change
![Page 23: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/23.jpg)
23
Work conducted in The RACCOON Project on Peer-based Data Integration and sharing,UC Irvine
![Page 24: Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems](https://reader036.vdocuments.net/reader036/viewer/2022070406/56814083550346895dac092d/html5/thumbnails/24.jpg)
24
AcknowledgementsWe thank Jia Li for her help on the
preparation of these slides.