managing a space of heterogeneous data

Managing a Space of Heterogeneous Data

Xin (Luna) DongUniversity of Washington

March, 2007

Once upon a time…

Nowadays…

D1

D2

D3

D4

D5

Mappings Between Heterogeneous Data Sources

Name Length Status Price Rate

The Departed

…

151 mins

…

In stock

…

$34.99

…

Excellent

…

MovieDVD

ID Title Year Genre Runtime Director

15827 The Departed 2006 Crime 151 min 32468

Movie

DirectorID Name

32468 Martin Scorsese

Director

MovieID Review

15827 Martin Scorsese Hits the Streets Again!

Review

Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front

D1

D2

D3

D4

D5

Mediated Schema

QQQ

Q1

Q2Q4

Q

Q2Q2

Q5

Q3

In Many Applications it is Hard to Obtain Precise Semantic Mappings

D1

D2

D3

D4

D5?

Scenario 1. Different Websites About Movies

IntranetInternet

Scenario 2. Personal Information Space

In Many Applications it is Hard to Obtain Precise Semantic Mappings

D1

D2

D3

D4

D5

Mediated Schema

Q

Managing Dataspaces

Dataspaces [Halevy et al., PODS’06]Collections of heterogeneous data

sourcesNot necessarily include semantic

mappingsScenarios: personal information,

enterprises, government agencies, smart homes, digital libraries, and the Web

My goal: Provide quality search, querying and browsing as the system evolves

Heterogeneity at Different Levels

Name:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Heterogeneity at Instance LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]


Form of heterogeneity The same real-world object

can be referred to using different attribute values

Current work Record linkage: most works

assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006])

Contributions Reference reconciliation:

reconcile instances of multiple classes and with only limited attributes [Sigmod’05]

Heterogeneity at Schema LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]


Form of heterogeneity The same domain can be

described using different schemas

Data can be (semi-)structured or unstructured

Current work Schema matching (Surveyed

in [Rahm&Bernstein, 2001]) Query reformulation

(Surveyed in [Halevy 2000]) Contributions

Probabilistic schema mapping [VLDB’07]

Visualizing heterogeneous data [InfoVis’07]

Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]


Form of heterogeneity Different terms and

different levels of structural details

Keyword search: ‘Semex Dong’

Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)

Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]


Form of heterogeneity Different terms and different

levels of structural detailsKeyword search: ‘Semex Dong’

Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)

Current work Keyword search on databases

(Discover, DBExplorer, etc.) Contributions

Seamless querying of structured and unstructured data

Indexing heterogeneous data [Sigmod’07]

Answering structured queries on unstructured data [WebDB’06]

Outline Problem definition and goals Semex Personal Information

Management System [CIDR’05, one of three Best Demos at Sigmod’05]

Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on

unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis

2007] Future research directions

OriginitatedFrom

PublishedIn

ConfHomePageExperimentOf

ArticleAbout

BudgetOf

CourseGradeIn

AddressOf

Cites

CoAuthor

FrequentEmailer

HomePage

Sender

EarlyVersion

Recipient

AttachedTo

PresentationFor

ComeFrom

Semex Generates a Logical View of Meaningful Objects and Associations

Semex Provides Association Browsing of One’s Personal Information

Names

Emails

Alon. Y. Levy

Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information

Management and Integration

Title

Year


CIDR


Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage

Question 1: Which emails has my advisor sent me about my thesis?

[email protected]@[email protected]@transformic.com

Question 2: Who have been working on schema matching?

6 Messages67 Articles

31 Persons working on Schema Matching (e.g., Alon Halevy, Phil

Bernstein, Renee Miller, Anhai Doan)

Search ‘Schema Matching’

Question 3: Which of my friends published in Sigmod 2007?

My friends who published papers in

Sigmod 2007

Data Integration Module

SchemaManagement

Module

DomainModel

ReferenceReconciliater

Association DB

Extractors

Indexer Index

ObjectsAssociations

Word PPT PDF Latex Email Webpage Excel DB

Integrator

Searcher Browser Analyzer

DomainManager

Data Analysis Module

DomainModel

ReferenceReconciliater

Association DB

Extractors

Indexer Index

ObjectsAssociations

Word PPT PDF Latex Email Webpage Excel DB

Integrator

Searcher Browser Analyzer

Semex Architecture

DomainManager

Outline Problem definition and our principle Semex Personal Information

Management System [CIDR’05, one of three Best Demos at Sigmod’05]

Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on

unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis

2007] Future research directions




Instance level• Reference Reconciliation

[Sigmod’05]

Query level• Answering structured

queries on unstructured data [WebDB’06]

• Indexing heterogeneous data [Sigmod’07]

Schema level• Probabilistic schema

mapping[VLDB’07]

• Visualization of heterogeneous data [InfoVis’07]

Reference Reconciliation is Crucial in Dataspaces

Xin (Luna) Dong

xin dong

•¶ ðà xinluna dong

luna

dongxin

x. dong

Lab-#dong xin

dong xin luna

Names

Emails

Previous Approaches

A very active area of research in databases, data mining and AI

Most current approaches assume matching tuples from a single database tableTraditional approaches are based on pair-

wise comparisons (Surveyed in [Winkler, 2006])

New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004]

Harder for a complex information space

Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},

c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},

c2)



Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

1. MultipleClasses 3. Multi-value

Attributes

2. Limited Information

?

?

Intuition: Exploit Association Network

We extract from dataspaces networks of instances and associations between the instances

Key: exploit the network, specifically, the clues hidden in the associations

Strategy I. Exploiting Richer Evidence Cross-attribute similarity –

Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)

Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7

Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article

Considering Only Attribute-wise Similarities Cannot Merge Persons Well

1750

1950

2150

2350

2550

2750

2950

3150

3350

1 2 3 4

Evidence

#(P

ers

on

Par

titi

on

s)

Person references: 24076 Real-world persons (gold-standard):1750

3159

1409

1750

Considering Richer Evidence Improves the Result

3159

2169 21692096

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

1409

346

Person references: 24076 Real-world persons:1750

1750

Strategy II. Propagate Information Between Reconciliation Decisions Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)



Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

3159

2169 21692096

3159

2146 2135

2022

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-w ise Name&Email Article Contact

Evidence

#(Pe

rson

Par

titio

ns)

Traditional Propagation

Propagating Information Between Reconciliation Decisions Further Improves the Result


1409

272346

1750

Strategy III. Reference Enrichment p2=(“Michael Stonebraker”, null,

{p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)

p8-9 =(“mike”, “[email protected]”, {p7})

V

XXV

References Enrichment Improves the Result More than Information Propagation

3159

2169 21692096

3169

2036 2036

19101750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

erso

n P

arti

tio

ns)

Traditional Enrichment Propagation


1409

160346

1750

3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Applying Both Information Propagation and Reference Enrichment Gets the Best Result


1409

125346

1750

Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for

all data sets Measure

Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs)

Recall: #(correctly reconciled reference pairs)#(reference pairs that refer to the same real-world object)

F-measure: 2·Precision·Recall Precision+Recall

Precision and Recall Increase Largely Compared with Attr-wise Matching

Dataset

Attr-wise Matching Association Network

Precision

Recall FPrecisi

onRecall F

ABCD

Avg

0.9950.81

0.9870.694

0.872

0.5090.8030.7820.837

0.733

0.6730.8060.8730.759

0.778

0.9820.9580.8140.942

0.924

0.9470.8910.9250.737

0.875

0.9640.9230.8670.827

0.895





[Sigmod’05]





mapping[VLDB’07]

• Visualization of heterogeneous data [InfoVis’07]

Seamless Querying of Structured and Unstructured Data

Structured Data & Semi-structured Data

Unstructured Data

RDF

webpages

XML

Doc

RDB

Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

Keyword Search “dataspaces”

I. Answering Structured Queries on Unstructured Data


Unstructured Data

RDF

webpages

XML

Doc

RDB



DB

DB

IR

?

Our approach: query translation Transform a structured query into keyword search Keyword search on unstructured data

Challenges

ExampleSELECT title

FROM paper

WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

select title from paper where title LIKE +dataspaces and year +2005

Top-10Precision

0

Challenges

ExampleSELECT title

FROM paper


title paper title +dataspaces year +2005

Top-10Precision

0

Challenges

ExampleSELECT title

FROM paper


+dataspaces +2005Top-10

Precision

0.2

Challenges

ExampleSELECT title

FROM paper


+dataspaces +2005 paper titleTop-10

Precision

0.2

Challenges

ExampleSELECT title

FROM paper


+dataspaces +2005 paperTop-10

Precision

0.6

II. Answering Queries that Combine Keywords and Structural Information


Unstructured Data

RDF

webpages

XML

Doc

RDB





Unstructured Data

RDF

webpages

XML

Doc

RDB


Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)


Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries

6 Messages67 Articles

Search ‘Schema Matching’

31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)

Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements

Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)



Unstructured Data

RDF

webpages

XML

Doc

RDB


Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)


Indexing Heterogeneous Data

Challenges Index data from heterogeneous data sources Capture both text values and structural

information Traditional Indexes

Build a separate index for each attribute to support structured queries

Build an inverted list to support keyword search XML indexes assume tree models and build

multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)

Index Heterogeneous Data Using an Inverted List

Desktop

Alon Halevy

Luna Dong

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon

Dong

Halevy

Luna

Semex

Xin

Inverted List

Desktop

Index Heterogeneous Data Using an Inverted List

Alon Halevy


author

authoredPaper

author


1000001 Xin Dong …

… … … …


Alon 1

Dong 1 1

Halevy 1

Luna 1

Semex 1

Xin 1

Inverted List

Luna Dong

Query: Dong

Desktop

Incorporate Attribute Labels in the Inverted List

Alon Halevy


author

authoredPaper

author


1000001 Xin Dong …

… … … …


Alon 1

Dong 1 1

Halevy 1

Luna 1

Semex 1

Xin 1

Inverted List

Luna Dong

Query: firstName “Dong”

Desktop

Incorporate Attribute Labels in the Inverted List

Query: firstName “Dong”

Alon Halevy


author

authoredPaper

author


1000001 Xin Dong …

… … … …


Alon/name/ 1

Dong/name/ 1

Dong/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/lastName/ 1

Inverted List

Luna Dong

Query: firstName “Dong” “Dong/firstName/”

Desktop

Incorporate Attribute Hierarchy in the Inverted List

Query: name “Dong”

Alon Halevy


author

authoredPaper

author


1000001 Xin Dong …

… … … …


Alon/name/ 1

Dong/name/ 1

Dong/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/lastName/ 1

Inverted List

Luna Dong

Desktop

Incorporate Attribute Hierarchy in the Inverted List

Query: name “Dong”

Alon Halevy


author

authoredPaper

author


1000001 Xin Dong …

… … … …


Alon/name/ 1

Dong/name/ 1

Dong/name/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/name/lastName/ 1

Inverted List

Luna Dong

Query: name “Dong” “Dong/name/*”

name

firstName lastName

Desktop

Incorporate Association Labels in the Inverted List

Query: author “Dong”

Alon Halevy

authoredPaper

author

authoredPaper


1000001 Xin Dong …

… … … …


Inverted List

Luna Dong

Semex: …

Alon/name/ 1

Dong/name/ 1


Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/name/lastName/ 1

author

Desktop

Incorporate Association Labels in the Inverted List

Alon Halevy

authoredPaper

author

authoredPaper

author

StuID LastName FirstName …

1000001 Xin Dong …

… … … …


Inverted List

Luna Dong

Semex: …

Alon/author/ 1

Alon/name/ 1

Dong/author/ 1

Dong/name/ 1


Halevy/name/ 1

Luna/name/ 1

Semex/authoredPaper/ 1 1

Semex/title/ 1

Xin/name/LastName/ 1

Query: author “Dong”Query: author “Dong” “Dong/author/*”

Desktop

Answering Neighborhood Keyword Queries

Alon Halevy

authoredPaper

author

authoredPaper

author

StuID LastName FirstName …

1000001 Xin Dong …

… … … …


Inverted List

Luna Dong

Semex: …

Alon/author/ 1

Alon/name/ 1

Dong/author/ 1

Dong/name/ 1


Halevy/name/ 1

Luna/name/ 1

Semex/authoredPaper/ 1 1

Semex/title/ 1

Xin/name/LastName/ 1

Query: SemexQuery: Semex “Semex/*”

Experimental Setting Data sets

A 50MB personal data set Two 10GB XML data sets: Wikipedia, XMark

Benchmark

Queries: with one predicate or keyword Predicate Query with leaf attributes Predicate Query with branch attributes Predicate Query with associations Neighborhood Keyword Query

Measure: in millisecond Index-lookup time Query-answering time

Our Indexing Method Significantly Improves Query Answering

Query Type

Plain Inverted List

(10.6MB)

Extended Inverted List

(28.1MB)

Index Lookup

(ms)

Query Answer

(ms)

Index Lookup

(ms)

Query Answer

(ms)

Pred Query with leaf

attributes2 22 4 6

Pred Query with branch attributes

3 43 4 6

Pred Query with

associations3 88 6 17

Neighborhood Keyword

Query18 4174 48 97

Our Indexing Method Scales Well

WikipediaXMark

w/o assoXMark

with asso

Index 4.15hr(1.13GB)

6.64hr(3.04GB)

12.72hr(4.08GB)


attributes156 94 116


- 67 93

Pred Query with

associations- - 217


Query1646 1838 13468





[Sigmod’05]





mapping(VLDB’07)

• Visualization of heterogeneous data (InfoVis’07)

Probabilistic Schema Mapping S=(pname, email-addr, home-addr, office-

addr)

T=(name, mailing-addr)

Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1

By-Table v.s. By-Tuple Semantics


pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale San Jose

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

Alice Sunnyvale

Bob San Jose

name

mailing-addr

Alice alice@

Bob bob@ 0.5 0.4 0.1


pname

email-addr

mailing-addr

office-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale San Jose

Ds=

name

mailing-addr

AliceMountain

View

Bob San Jose

DT=nam

emailing-

addr

AliceSunnyval

e

Bob San Jose

name

mailing-addr

AliceSunnyval

e

Bob bob@ 0.2 0.16 0.04

…

Theoretical Results Query answering in by-table semantics

In PTIME in the size of the data Query answering in by-tuple

semanticsIn general #P-complete in the size of the

dataIn PTIME for two types of queries

The query contains a single table that is a target in a probabilistic mapping

If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute

More Theoretical Results Query answering in both semantics is

in PTIME in the size of the probabilistic mapping

Compress representations of probabilistic mappings We propose two compact representations of

probabilistic mappings, such that query answering is still in PTIME in the size of the mapping

When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping

Conclusions Goal: Provide quality search, querying

and browsing for dataspaces Thesis Contributions

An algorithm for reference reconciliation An indexing method for supporting queries

that combine keywords and structure An algorithm for answering structured

queries on unstructured data The concept and theoretical foundation for

Probabilistic Schema Mapping An approach for visualizing heterogeneous

data A PIM system incorporating the above

Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis

D1

D2

D3

D4

D5

Mediated Schema

Q

D1

D2

D3

D4

D5

Future Work II. Manage Dataspaces at the Web-Scale

Future Work II. Manage Dataspaces at the Web-Scale

Challenges: Large scale and complex domains

Future directions:1. Probabilistic data integration2. Information redundancy3. Universal search

Keyword Search

Research Methodology

MachineLearning

InformationRetrieval

Database

Theory

1. Semex Personal Information Management System[Sigmod’05 Best Demo]

2. Woogle Web ServiceSearch Engine [VLDB’04]

1. Probabilistic Schema Mapping [VLDB’07]

2. XML Query Containment [VLDB’04]

3. Optimization of Query Difference (Submitted)System

co-worker

AcknowledgementProject: Semex

advisor co-worker

ArticleAbout

ArticleAbout

ArticleAboutCIDR

publishedIn

publishedIn

publishedIn

StanfordVisual Grp

collaborator

collaborator

Person: Luna

participant

Person: AlonprojectLeader

Person: Jayant

participant

Person: Michelle

Person: Yuhan

participantparticipant

co-worker

Our Algorithm Equals or OutperformsAttr-wise Matching in All Classes

Class

Attr-wise Matching

Association Network

Precision

RecallPrecisi

onRecall

Person

Article

Venue

0.8720.9970.935

0.7330.9770.790

0.9240.9990.987

0.8750.9760.937

Results on Cora Dataset is Competitive with Other Reported Results

Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003]

Class

Attr-wise Matching

Dependency Graph

Prec/Recall

F-msre

Prec/Recall F-msre

Article

PersonVenue

0.985/0.913

0.994/0.985

0.982/0.362

0.948 0.9890.529

0.985/0.924

1/0.9870.837/0.71

4

0.954 0.9930.771

Experiment Settings

Measure: Diversity and DispersionDiversity: For every result partition,

how many real-world objects are included; ideally should be 1 (related to precision)

Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

Diversity and Dispersion Are Very Close to 1

Dataset#per/#ref

Attr-wise Matching Dependency Graph

Diversity/Dispersion Diversity/Dispersion

A (1750/2407

6)B

(1989/36359)C

(1570/15160)D

(1518/17199)

Avg

1.18/1.0031.067/1.01

1.053/1.0031.041/1.004

1.085/1.005

1.047/1.0031.039/1.0081.03/1.017

1.023/1.005

1.035/1.008

Our Indexing Method Scales Well

WikipediaXMark

w/o assoXMark

with asso

Index 4.15hr(1.13GB)

6.64hr(3.04GB)

12.72hr(4.08GB)


attributes156 94 116


- 67 93

Pred Query with

associations- - 217


Query1646 1838 13468

I. Visualizing Heterogeneous Data Current data visualization

Consider only data residing in a single database

Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005])

Visualization of dataspaces need to consider data from heterogeneous sources

Example Visualization —A Map Marked with Papers

Example Visualization —A Calendar with Presentation Slides

managing a space of heterogeneous data

Documents

phd workshop

personal information

xin dong

dongemail addresses

semex dongstructured

query levelname

paper title

different schemasdata