principles of dataspace systems alon halevy pods june 26, 2006
Post on 19-Dec-2015
219 views
TRANSCRIPT
Outline
• Example data management challenges Denote by: “dataspaces” [Franklin, H., Maier]
• Dataspace Support Platforms:– “Pay-as-you-go” data management
• Putting meat to the bones:– Specific research problems, recent progress– Querying, dataspace evolution, reflection
• (Possibly) some predictions and subliminal messages.
OriginatedFrom
PublishedIn
ConfHomePage
ExperimentOf
ArticleAbout
BudgetOf
CourseGradeIn
AddressOf
Cites
CoAuthor
FrequentEmailer
HomePage
Sender
EarlyVersion
Recipient
AttachedTo
PresentationFor
Personal Information Management[Semex: Dong et al.]
The Web is Getting Semantic
• Forms (millions)• Vertical search engines (hundreds)• Annotation schemes: Flickr, ESP Game• Google Coop
– DB search engine coming soon!
“A little semantics goes a long way”
[See Madhavan talk on Wednesday afternoon]
“Data is the plural of anecdote”
• Digital libraries, enterprises, “smart homes”
• Corie Environmental Observation System– Talk to Maier
• Circle of Blue– Data about the world’s water sources
• The Boeing 777– [Hanrahan @ Stanford]
Requirements
• A system that:– Is defined by boundaries (organizational,
physical, logical), not explicit entry.
– Provides best-effort services• Little or no setup time
– Leverages semantics when possible
Manage dataspaces
Other Dataspace Characteristics
• All dataspaces contain >20% porn.
• The rest has >50% spam.
Dataspaces vs. Data Integration
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Data integration requires semantic mappings
Dataspaces vs. Data Integration
Dataspaces are “pay as you go”
Benefit
Investment (time, cost)
Dataspaces
Data integration solutions
Artist: Mike Franklin
Dataspaces vs. Data Integration
Data integration systems require semantic mappings.– Dataspaces are “pay-as-you-go”
• Dataspaces capture a broader class of semantic relationships:
– DerivedFrom, SnapshotOf, HighlyCorrelatedWith, …
Why Do This Now?
• Fact:– Data management is about people, not enterprises.
• Prediction:– In the next few years, DB conferences will be about
data management for the masses.
• “CMA”:– If not, our community will become largely irrelevant.
• Observation:– We’re doing it anyway: e.g., DB&IR, information
extraction, uncertainty, …
Outline
Example data management challenges Denote by: “dataspaces”
Dataspace Support Platforms:– “Pay-as-you-go” data management
• Putting meat to the bones:– Specific research problems, recent progress– Querying, dataspace evolution, reflection
• (Possibly) some predictions and subliminal messages.
Logical Model:Participants and Relationships
XML
WSDL
SDB
sensor
sensor
sensorjava
RDF
RDB
WSDL
XML
java
schema mappingmanually created
RDB
snapshot
view replica
1hr updates
Relationships
• General form:
(Obj1, Rel, Obj2, p)
• Obj1, Obj2: instances or sources,
• p: degree of certainty
• Language for describing relationships? – [Rosati]
Dataspace Support Platforms (DSSP)
XML
WSDL
SDB
sensor
sensor
sensorjava
RDF
RDB
WSDL
XML
java
schema mappingmanually created
RDB
snapshot
view replica
1hr updates
Catalog
Search & query
Local Store &Index
Administration
Discover &
Enhance
OutlineExample data management challenges
Denote by: “dataspaces”
Dataspace Support Platforms:– “Pay-as-you-go” data management
Putting meat to the bones:– Specific research problems, recent progress– Querying, dataspace evolution, reflection
• (Possibly) some predictions and subliminal messages.
Dataspace Queries
• Keyword queries as starting point– Later may be refined to add structure– Formulated in terms of user’s “schema”
• Mostly of the form– Instance*:
• “britany spears”
– P (instance)• “chicago weather”• “PC chair PODS”
Query
Evolve
Reflect
Semantics of Answers
1. The actual answers:– P(instance), P*(instance)
2. Sources where answer can be found:– Partially specify the query to the source– Help the user clean the query
Query
Evolve
Reflect
Semantics of Answers
1. The actual answers:– P(instance), P*(instance)
2. Sources where answer can be found:– Partially specify the query to the source– Help the user clean the query
3. Supporting facts or sources:– Facts that can be used to derive P(instance)– Rest of derivation may be obvious to user
Query
Evolve
Reflect
Related or Partial Answers
• In which country was Dan Suciu born?– Bucharest
• Latest edition of software X:– 2004 edition
• Is the Space Needle higher than the Eiffel Tower?– Height of Seattle Space Needle– Height of Eiffel Tower
Query
Evolve
Reflect
184m
324m
Rank all types of answers
Query Processing: Data Integration
Q1Q2Q3Q4
Q41
Q42
Q43
Weather(Chicago)
Query
Evolve
Reflect
PDMSActive XMLData exchange
Query Processing: DSSPsQuery: Jan Van den Bussche address
First name: JanMiddle name: Last name: Van den BusscheAddress: ?
Query
Evolve
Reflect
Query Processing: DSSPsFirst name: JanMiddle name: Last name: Van den BusscheAddress: ?
addressaddress:
City required
Keyword query:J. vd Bussche
……
City?
StreetAdr
……
(t1,p1)
city, zip
(t2,p2)
……
Companiesaddress
Query
Evolve
Reflect
(t1,p3)
Two Principles
• Mappings are approximate at best– What do approximate mappings mean? – Answering queries with them?– Composition?
• Answering queries =Finding evidence + combining evidence
Query
Evolve
Reflect
The Cost of Semantics
Benefit
Investment (time, cost)
Dataspaces
Data integration solutions
Artist: Mike Franklin
?
Semantic integration modeled by:
{(Obj1, rel, Obj2, p),…}
Query
Evolve
Reflect
Reusing Human Attention• Principle:
User action = statement of semantic relationshipLeverage actions to infer other semantic relationships
• Examples– Providing a semantic mapping
• Infer other mappings
– Writing a query • Infer content of sources, relationships between sources
– Creating a “digital workspace”• Infer “relatedness” of documents/sources• Infer co-reference between objects in the dataspace
– Annotating, cutting & pasting, browsing among docs
Query
Evolve
Reflect
Past, Present and Future
• Leverage past actions & existing structure: – [Dong et al., 2004, 2005], [He & Chang, 2003]
• Generalize from current actions– Queries, schema mappings
• Beg for extra attention:– ESP [von Ahn], mass collaboration [Doan+],
active learning [Sarawagi et al.]
Query
Evolve
Reflect
Reuse: Learning Schema Mappings[Doan et al.]
• Classifiers for mediated schema Thousands of web forms mapped in little time
Transformic Inc: deep web search.
[Madhavan et al.]: infer mappings for any schemas in the domain
Mediated schemaAction
Query
Evolve
Reflect
( S1, M, S, p)
Technical Outline
Query
Evolve
Reflect
• Unify lineage, uncertainty, and inconsistency
• Model them on views
Dataspace Reflection
• Answers are uncertain in dataspaces:– data, – mappings, – query answering techniques
• Data may be inconsistent
• Tracking lineage is crucial
Query
Evolve
Reflect
A DSSP Needs to
• Process uncertain data
• Update uncertainty with new evidence
• Be proactive about reducing uncertainty
• Live with inconsistency
• Leverage lineage to reduce uncertainty
( Obj1, Rel, Obj2, p)
Query
Evolve
Reflect
Two Principles
• Need a single formalism for modeling:– uncertainty,– inconsistency, and– lineage
• Model them on views
Query
Evolve
Reflect
Uncertainty and Inconsistency
• Inconsistency = uncertainty about the truth– Salary (John Doe, $120,000)– Salary (John Doe, $135,000)Salary (John Doe, $120,000 | $135,000)
• Orchestra, Ives @ U. Penn.
Query
Evolve
Reflect
Uncertainty Formalisms 101
• Represent a set of possible worlds
• A-tuples - uncertainty on attribute values:– (PODS, 2006, {Chicago, Baltimore})– Tuples can be optional
• Not closed under relational operators
Query
Evolve
Reflect
Uncertainty Formalisms 101 (2)
• X-tuples – uncertainty on entire tuples:– {(PODS, 2006, Chicago), (PODC, 2006, Baltimore)}
• Still not closed under relational operators
Query
Evolve
Reflect
Uncertainty 101: C-Tables
• { (PODC, 2006, Chicago, X=1),
(PODS, 2006, Chicago, X <>1),
(SIGMOD, 2006, Chicago, X<>1) }
• Closed and Complete!
• Understandable?– See [Das Sarma et al., ICDE 2006]: working
models for uncertain data Query
Evolve
Reflect
Adding Lineage to X-tuples[ULDBs, Benjelloun et al., VLDB 06]
{(PODS, 2006, Ch), (PODC, 2006, Ba)}
l1 l2! (l1 & l2){ (…) (…) }
(t)
{ (…) (…) }
(t’)Query
Evolve
ReflectNo effect on complexity of relational operators
Adding Lineage to A-tuples
(PODS, {2005,2006}, {Chicago, Baltimore})
l1 l2 l3 l4Lineage on views (projections) Uncertainty should also be on views!
(Halevy, Los Altos, CA, professor},
0.8 0.6
Answering queries using uncertain views See [Dalvi & Suciu, 2005] for a great start
Query
Evolve
Reflect
Putting it all Together
Query
Evolve
Reflect
• Approximate mappings• Evidence combination
• Reuse human attention• Create approximate semantic relationships
• Foundation for reasoning about uncertainty, inconsistency and lineage
Conclusion and Outlook
• Data management moving to consumers
• Dataspaces: key element in this agenda– Pay as you go data management– Reuse human attention
• The role of theory:– Reflect, generalize and explain– People, people, people