principles of dataspace systems alon halevy pods june 26, 2006

50
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006

Post on 19-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Principles of Dataspace Systems

Alon Halevy

PODS

June 26, 2006

Outline

• Example data management challenges Denote by: “dataspaces” [Franklin, H., Maier]

• Dataspace Support Platforms:– “Pay-as-you-go” data management

• Putting meat to the bones:– Specific research problems, recent progress– Querying, dataspace evolution, reflection

• (Possibly) some predictions and subliminal messages.

Shrapnels in Baghdad

Story courtesy of Phil Bernstein

OriginatedFrom

PublishedIn

ConfHomePage

ExperimentOf

ArticleAbout

BudgetOf

CourseGradeIn

AddressOf

Cites

CoAuthor

FrequentEmailer

HomePage

Sender

EarlyVersion

Recipient

AttachedTo

PresentationFor

Personal Information Management[Semex: Dong et al.]

Google Base

The Web is Getting Semantic

• Forms (millions)• Vertical search engines (hundreds)• Annotation schemes: Flickr, ESP Game• Google Coop

– DB search engine coming soon!

“A little semantics goes a long way”

[See Madhavan talk on Wednesday afternoon]

“Data is the plural of anecdote”

• Digital libraries, enterprises, “smart homes”

• Corie Environmental Observation System– Talk to Maier

• Circle of Blue– Data about the world’s water sources

• The Boeing 777– [Hanrahan @ Stanford]

Requirements

• A system that:– Is defined by boundaries (organizational,

physical, logical), not explicit entry.

– Provides best-effort services• Little or no setup time

– Leverages semantics when possible

Manage dataspaces

Dataspaces vs. Data Integration

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Data integration requires semantic mappings

Dataspaces vs. Data Integration

Dataspaces are “pay as you go”

Benefit

Investment (time, cost)

Dataspaces

Data integration solutions

Artist: Mike Franklin

Dataspaces vs. Data Integration

Data integration systems require semantic mappings.– Dataspaces are “pay-as-you-go”

• Dataspaces capture a broader class of semantic relationships:

– DerivedFrom, SnapshotOf, HighlyCorrelatedWith, …

Why Do This Now?

• Fact:– Data management is about people, not enterprises.

• Prediction:– In the next few years, DB conferences will be about

data management for the masses.

• “CMA”:– If not, our community will become largely irrelevant.

• Observation:– We’re doing it anyway: e.g., DB&IR, information

extraction, uncertainty, …

Outline

Example data management challenges Denote by: “dataspaces”

Dataspace Support Platforms:– “Pay-as-you-go” data management

• Putting meat to the bones:– Specific research problems, recent progress– Querying, dataspace evolution, reflection

• (Possibly) some predictions and subliminal messages.

Logical Model:Participants and Relationships

XML

WSDL

SDB

sensor

sensor

sensorjava

RDF

RDB

WSDL

XML

java

schema mappingmanually created

RDB

snapshot

view replica

1hr updates

Relationships

• General form:

(Obj1, Rel, Obj2, p)

• Obj1, Obj2: instances or sources,

• p: degree of certainty

• Language for describing relationships? – [Rosati]

Dataspace Support Platforms (DSSP)

XML

WSDL

SDB

sensor

sensor

sensorjava

RDF

RDB

WSDL

XML

java

schema mappingmanually created

RDB

snapshot

view replica

1hr updates

Catalog

Search & query

Local Store &Index

Administration

Discover &

Enhance

OutlineExample data management challenges

Denote by: “dataspaces”

Dataspace Support Platforms:– “Pay-as-you-go” data management

Putting meat to the bones:– Specific research problems, recent progress– Querying, dataspace evolution, reflection

• (Possibly) some predictions and subliminal messages.

Technical Outline

Query

Evolve

Reflect

• Queries

• Answers

• Query processing

Dataspace Queries

• Keyword queries as starting point– Later may be refined to add structure– Formulated in terms of user’s “schema”

• Mostly of the form– Instance*:

• “britany spears”

– P (instance)• “chicago weather”• “PC chair PODS”

Query

Evolve

Reflect

Semantics of Answers

1. The actual answers:– P(instance), P*(instance)

Query

Evolve

Reflect

Weather Seattle

Semantics of Answers

1. The actual answers:– P(instance), P*(instance)

2. Sources where answer can be found:– Partially specify the query to the source– Help the user clean the query

Query

Evolve

Reflect

Toyota Corolla Palo alto

Volvo Palo altoVolvo Palo altoToyota Palo alto

Semantics of Answers

1. The actual answers:– P(instance), P*(instance)

2. Sources where answer can be found:– Partially specify the query to the source– Help the user clean the query

3. Supporting facts or sources:– Facts that can be used to derive P(instance)– Rest of derivation may be obvious to user

Query

Evolve

Reflect

Related or Partial Answers

• In which country was Dan Suciu born?– Bucharest

• Latest edition of software X:– 2004 edition

• Is the Space Needle higher than the Eiffel Tower?– Height of Seattle Space Needle– Height of Eiffel Tower

Query

Evolve

Reflect

184m

324m

Rank all types of answers

Query Processing: Data Integration

Q1Q2Q3Q4

Q41

Q42

Q43

Weather(Chicago)

Query

Evolve

Reflect

PDMSActive XMLData exchange

Query Processing: DSSPsQuery: Jan Van den Bussche address

First name: JanMiddle name: Last name: Van den BusscheAddress: ?

Query

Evolve

Reflect

Query Processing: DSSPsFirst name: JanMiddle name: Last name: Van den BusscheAddress: ?

addressaddress:

City required

Keyword query:J. vd Bussche

……

City?

StreetAdr

……

(t1,p1)

city, zip

(t2,p2)

……

Companiesaddress

Query

Evolve

Reflect

(t1,p3)

Two Principles

• Mappings are approximate at best– What do approximate mappings mean? – Answering queries with them?– Composition?

• Answering queries =Finding evidence + combining evidence

Query

Evolve

Reflect

Technical Outline

Query

Evolve

Reflect• Reuse human

attention

The Cost of Semantics

Benefit

Investment (time, cost)

Dataspaces

Data integration solutions

Artist: Mike Franklin

?

Semantic integration modeled by:

{(Obj1, rel, Obj2, p),…}

Query

Evolve

Reflect

Reusing Human Attention• Principle:

User action = statement of semantic relationshipLeverage actions to infer other semantic relationships

• Examples– Providing a semantic mapping

• Infer other mappings

– Writing a query • Infer content of sources, relationships between sources

– Creating a “digital workspace”• Infer “relatedness” of documents/sources• Infer co-reference between objects in the dataspace

– Annotating, cutting & pasting, browsing among docs

Query

Evolve

Reflect

Past, Present and Future

• Leverage past actions & existing structure: – [Dong et al., 2004, 2005], [He & Chang, 2003]

• Generalize from current actions– Queries, schema mappings

• Beg for extra attention:– ESP [von Ahn], mass collaboration [Doan+],

active learning [Sarawagi et al.]

Query

Evolve

Reflect

Reuse: Learning Schema Mappings[Doan et al.]

• Classifiers for mediated schema Thousands of web forms mapped in little time

Transformic Inc: deep web search.

[Madhavan et al.]: infer mappings for any schemas in the domain

Mediated schemaAction

Query

Evolve

Reflect

( S1, M, S, p)

Technical Outline

Query

Evolve

Reflect

• Unify lineage, uncertainty, and inconsistency

• Model them on views

Dataspace Reflection

• Answers are uncertain in dataspaces:– data, – mappings, – query answering techniques

• Data may be inconsistent

• Tracking lineage is crucial

Query

Evolve

Reflect

A DSSP Needs to

• Process uncertain data

• Update uncertainty with new evidence

• Be proactive about reducing uncertainty

• Live with inconsistency

• Leverage lineage to reduce uncertainty

( Obj1, Rel, Obj2, p)

Query

Evolve

Reflect

Two Principles

• Need a single formalism for modeling:– uncertainty,– inconsistency, and– lineage

• Model them on views

Query

Evolve

Reflect

Israel populationIsrael population

Query

Evolve

Reflect

Uncertainty &

Lineage

Trio @Stanford

Uncertainty and Inconsistency

• Inconsistency = uncertainty about the truth– Salary (John Doe, $120,000)– Salary (John Doe, $135,000)Salary (John Doe, $120,000 | $135,000)

• Orchestra, Ives @ U. Penn.

Query

Evolve

Reflect

Uncertainty Formalisms 101

• Represent a set of possible worlds

• A-tuples - uncertainty on attribute values:– (PODS, 2006, {Chicago, Baltimore})– Tuples can be optional

• Not closed under relational operators

Query

Evolve

Reflect

Uncertainty Formalisms 101 (2)

• X-tuples – uncertainty on entire tuples:– {(PODS, 2006, Chicago), (PODC, 2006, Baltimore)}

• Still not closed under relational operators

Query

Evolve

Reflect

Uncertainty 101: C-Tables

• { (PODC, 2006, Chicago, X=1),

(PODS, 2006, Chicago, X <>1),

(SIGMOD, 2006, Chicago, X<>1) }

• Closed and Complete!

• Understandable?– See [Das Sarma et al., ICDE 2006]: working

models for uncertain data Query

Evolve

Reflect

Adding Lineage to X-tuples[ULDBs, Benjelloun et al., VLDB 06]

{(PODS, 2006, Ch), (PODC, 2006, Ba)}

l1 l2! (l1 & l2){ (…) (…) }

(t)

{ (…) (…) }

(t’)Query

Evolve

ReflectNo effect on complexity of relational operators

Adding Lineage to A-tuples

(PODS, {2005,2006}, {Chicago, Baltimore})

l1 l2 l3 l4Lineage on views (projections) Uncertainty should also be on views!

(Halevy, Los Altos, CA, professor},

0.8 0.6

Answering queries using uncertain views See [Dalvi & Suciu, 2005] for a great start

Query

Evolve

Reflect

Putting it all Together

Query

Evolve

Reflect

• Approximate mappings• Evidence combination

• Reuse human attention• Create approximate semantic relationships

• Foundation for reasoning about uncertainty, inconsistency and lineage

Conclusion and Outlook

• Data management moving to consumers

• Dataspaces: key element in this agenda– Pay as you go data management– Reuse human attention

• The role of theory:– Reflect, generalize and explain– People, people, people

Some References

• SIGMOD Record, December 2005:– Original dataspace vision paper

• Maier EDBT 2006 tak

• PODS 2006 proceedings: challenges

• Data Integration: the Teenage Years– VLDB 2006

• Teaching integration to undergraduates: – SIGMOD Record, September, 2003.