managing a space of heterogeneous data

90
Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007

Upload: sezja

Post on 15-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Managing a Space of Heterogeneous Data. Xin (Luna) Dong University of Washington March, 2007. Once upon a time…. D5. D1. D2. D4. D3. Nowadays…. Mappings Between Heterogeneous Data Sources. MovieDVD. Movie. Director. Review. Mediated Schema. D5. D1. D2. D4. D3. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Managing a Space of Heterogeneous Data

Managing a Space of Heterogeneous Data

Xin (Luna) DongUniversity of Washington

March, 2007

Page 2: Managing a Space of Heterogeneous Data

Once upon a time…

Page 3: Managing a Space of Heterogeneous Data

Nowadays…

D1

D2

D3

D4

D5

Page 4: Managing a Space of Heterogeneous Data

Mappings Between Heterogeneous Data Sources

Name Length Status Price Rate

The Departed

151 mins

In stock

$34.99

Excellent

MovieDVD

ID Title Year Genre Runtime Director

15827 The Departed 2006 Crime 151 min 32468

Movie

DirectorID Name

32468 Martin Scorsese

Director

MovieID Review

15827 Martin Scorsese Hits the Streets Again!

Review

Page 5: Managing a Space of Heterogeneous Data

Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front

D1

D2

D3

D4

D5

Mediated Schema

QQQ

Q1

Q2Q4

Q

Q2Q2

Q5

Q3

Page 6: Managing a Space of Heterogeneous Data

In Many Applications it is Hard to Obtain Precise Semantic Mappings

D1

D2

D3

D4

D5?

Page 7: Managing a Space of Heterogeneous Data

Scenario 1. Different Websites About Movies

Page 8: Managing a Space of Heterogeneous Data

IntranetInternet

Scenario 2. Personal Information Space

Page 9: Managing a Space of Heterogeneous Data

In Many Applications it is Hard to Obtain Precise Semantic Mappings

D1

D2

D3

D4

D5

Mediated Schema

Q

Page 10: Managing a Space of Heterogeneous Data

Managing Dataspaces

Dataspaces [Halevy et al., PODS’06]Collections of heterogeneous data

sourcesNot necessarily include semantic

mappingsScenarios: personal information,

enterprises, government agencies, smart homes, digital libraries, and the Web

My goal: Provide quality search, querying and browsing as the system evolves

Page 11: Managing a Space of Heterogeneous Data

Heterogeneity at Different Levels

Name:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Page 12: Managing a Space of Heterogeneous Data

Heterogeneity at Instance LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Form of heterogeneity The same real-world object

can be referred to using different attribute values

Current work Record linkage: most works

assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006])

Contributions Reference reconciliation:

reconcile instances of multiple classes and with only limited attributes [Sigmod’05]

Page 13: Managing a Space of Heterogeneous Data

Heterogeneity at Schema LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Form of heterogeneity The same domain can be

described using different schemas

Data can be (semi-)structured or unstructured

Current work Schema matching (Surveyed

in [Rahm&Bernstein, 2001]) Query reformulation

(Surveyed in [Halevy 2000]) Contributions

Probabilistic schema mapping [VLDB’07]

Visualizing heterogeneous data [InfoVis’07]

Page 14: Managing a Space of Heterogeneous Data

Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Form of heterogeneity Different terms and

different levels of structural details

Keyword search: ‘Semex Dong’

Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)

Page 15: Managing a Space of Heterogeneous Data

Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Form of heterogeneity Different terms and different

levels of structural detailsKeyword search: ‘Semex Dong’

Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)

Current work Keyword search on databases

(Discover, DBExplorer, etc.) Contributions

Seamless querying of structured and unstructured data

Indexing heterogeneous data [Sigmod’07]

Answering structured queries on unstructured data [WebDB’06]

Page 16: Managing a Space of Heterogeneous Data

Outline Problem definition and goals Semex Personal Information

Management System [CIDR’05, one of three Best Demos at Sigmod’05]

Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on

unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis

2007] Future research directions

Page 17: Managing a Space of Heterogeneous Data

OriginitatedFrom

PublishedIn

ConfHomePageExperimentOf

ArticleAbout

BudgetOf

CourseGradeIn

AddressOf

Cites

CoAuthor

FrequentEmailer

HomePage

Sender

EarlyVersion

Recipient

AttachedTo

PresentationFor

ComeFrom

Semex Generates a Logical View of Meaningful Objects and Associations

Page 18: Managing a Space of Heterogeneous Data

Semex Provides Association Browsing of One’s Personal Information

Names

Emails

Alon. Y. Levy

Page 19: Managing a Space of Heterogeneous Data

Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information

Management and Integration

Title

Year

Page 20: Managing a Space of Heterogeneous Data

Semex Provides Association Browsing of One’s Personal Information

CIDR

Page 21: Managing a Space of Heterogeneous Data

Semex Provides Association Browsing of One’s Personal Information

Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage

Page 22: Managing a Space of Heterogeneous Data

Question 1: Which emails has my advisor sent me about my thesis?

[email protected]@[email protected]@transformic.com

Page 23: Managing a Space of Heterogeneous Data

Question 2: Who have been working on schema matching?

6 Messages67 Articles

31 Persons working on Schema Matching (e.g., Alon Halevy, Phil

Bernstein, Renee Miller, Anhai Doan)

Search ‘Schema Matching’

Page 24: Managing a Space of Heterogeneous Data

Question 3: Which of my friends published in Sigmod 2007?

My friends who published papers in

Sigmod 2007

Page 25: Managing a Space of Heterogeneous Data

Data Integration Module

SchemaManagement

Module

DomainModel

ReferenceReconciliater

Association DB

Extractors

Indexer Index

ObjectsAssociations

Word PPT PDF Latex Email Webpage Excel DB

Integrator

Searcher Browser Analyzer

DomainManager

Data Analysis Module

DomainModel

ReferenceReconciliater

Association DB

Extractors

Indexer Index

ObjectsAssociations

Word PPT PDF Latex Email Webpage Excel DB

Integrator

Searcher Browser Analyzer

Semex Architecture

DomainManager

Page 26: Managing a Space of Heterogeneous Data

Outline Problem definition and our principle Semex Personal Information

Management System [CIDR’05, one of three Best Demos at Sigmod’05]

Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on

unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis

2007] Future research directions

Page 27: Managing a Space of Heterogeneous Data

Heterogeneity at Different Levels

Name:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Instance level• Reference Reconciliation

[Sigmod’05]

Query level• Answering structured

queries on unstructured data [WebDB’06]

• Indexing heterogeneous data [Sigmod’07]

Schema level• Probabilistic schema

mapping[VLDB’07]

• Visualization of heterogeneous data [InfoVis’07]

Page 28: Managing a Space of Heterogeneous Data

Reference Reconciliation is Crucial in Dataspaces

Xin (Luna) Dong

xin dong

•¶ ðà xinluna dong

luna

dongxin

x. dong

Lab-#dong xin

dong xin luna

Names

Emails

Page 29: Managing a Space of Heterogeneous Data

Previous Approaches

A very active area of research in databases, data mining and AI

Most current approaches assume matching tuples from a single database tableTraditional approaches are based on pair-

wise comparisons (Surveyed in [Winkler, 2006])

New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004]

Harder for a complex information space

Page 30: Managing a Space of Heterogeneous Data

Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Page 31: Managing a Space of Heterogeneous Data

Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},

c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},

c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

1. MultipleClasses 3. Multi-value

Attributes

2. Limited Information

?

?

Page 32: Managing a Space of Heterogeneous Data

Intuition: Exploit Association Network

We extract from dataspaces networks of instances and associations between the instances

Key: exploit the network, specifically, the clues hidden in the associations

Page 33: Managing a Space of Heterogeneous Data

Strategy I. Exploiting Richer Evidence Cross-attribute similarity –

Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)

Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7

Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article

Page 34: Managing a Space of Heterogeneous Data

Considering Only Attribute-wise Similarities Cannot Merge Persons Well

1750

1950

2150

2350

2550

2750

2950

3150

3350

1 2 3 4

Evidence

#(P

ers

on

Par

titi

on

s)

Person references: 24076 Real-world persons (gold-standard):1750

3159

1409

1750

Page 35: Managing a Space of Heterogeneous Data

Considering Richer Evidence Improves the Result

3159

2169 21692096

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

1409

346

Person references: 24076 Real-world persons:1750

1750

Page 36: Managing a Space of Heterogeneous Data

Strategy II. Propagate Information Between Reconciliation Decisions Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Page 37: Managing a Space of Heterogeneous Data

3159

2169 21692096

3159

2146 2135

2022

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-w ise Name&Email Article Contact

Evidence

#(Pe

rson

Par

titio

ns)

Traditional Propagation

Propagating Information Between Reconciliation Decisions Further Improves the Result

Person references: 24076 Real-world persons:1750

1409

272346

1750

Page 38: Managing a Space of Heterogeneous Data

Strategy III. Reference Enrichment p2=(“Michael Stonebraker”, null,

{p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)

p8-9 =(“mike”, “[email protected]”, {p7})

V

XXV

Page 39: Managing a Space of Heterogeneous Data

References Enrichment Improves the Result More than Information Propagation

3159

2169 21692096

3169

2036 2036

19101750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

Traditional Enrichment Propagation

Person references: 24076 Real-world persons:1750

1409

160346

1750

Page 40: Managing a Space of Heterogeneous Data

3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Applying Both Information Propagation and Reference Enrichment Gets the Best Result

Person references: 24076 Real-world persons:1750

1409

125346

1750

Page 41: Managing a Space of Heterogeneous Data

Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for

all data sets Measure

Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs)

Recall: #(correctly reconciled reference pairs)#(reference pairs that refer to the same real-world object)

F-measure: 2·Precision·Recall Precision+Recall

Page 42: Managing a Space of Heterogeneous Data

Precision and Recall Increase Largely Compared with Attr-wise Matching

Dataset

Attr-wise Matching Association Network

Precision

Recall FPrecisi

onRecall F

ABCD

Avg

0.9950.81

0.9870.694

0.872

0.5090.8030.7820.837

0.733

0.6730.8060.8730.759

0.778

0.9820.9580.8140.942

0.924

0.9470.8910.9250.737

0.875

0.9640.9230.8670.827

0.895

Page 43: Managing a Space of Heterogeneous Data

Heterogeneity at Different Levels

Name:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Instance level• Reference Reconciliation

[Sigmod’05]

Query level• Answering structured

queries on unstructured data [WebDB’06]

• Indexing heterogeneous data [Sigmod’07]

Schema level• Probabilistic schema

mapping[VLDB’07]

• Visualization of heterogeneous data [InfoVis’07]

Page 44: Managing a Space of Heterogeneous Data

Seamless Querying of Structured and Unstructured Data

Structured Data & Semi-structured Data

Unstructured Data

RDF

webpages

XML

Doc

RDB

Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

Keyword Search “dataspaces”

Page 45: Managing a Space of Heterogeneous Data

I. Answering Structured Queries on Unstructured Data

Structured Data & Semi-structured Data

Unstructured Data

RDF

webpages

XML

Doc

RDB

Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

Keyword Search “dataspaces”

DB

DB

IR

?

Our approach: query translation Transform a structured query into keyword search Keyword search on unstructured data

Page 46: Managing a Space of Heterogeneous Data

Challenges

ExampleSELECT title

FROM paper

WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

select title from paper where title LIKE +dataspaces and year +2005

Top-10Precision

0

Page 47: Managing a Space of Heterogeneous Data

Challenges

ExampleSELECT title

FROM paper

WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

title paper title +dataspaces year +2005

Top-10Precision

0

Page 48: Managing a Space of Heterogeneous Data

Challenges

ExampleSELECT title

FROM paper

WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

+dataspaces +2005Top-10

Precision

0.2

Page 49: Managing a Space of Heterogeneous Data

Challenges

ExampleSELECT title

FROM paper

WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

+dataspaces +2005 paper titleTop-10

Precision

0.2

Page 50: Managing a Space of Heterogeneous Data

Challenges

ExampleSELECT title

FROM paper

WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

+dataspaces +2005 paperTop-10

Precision

0.6

Page 51: Managing a Space of Heterogeneous Data

II. Answering Queries that Combine Keywords and Structural Information

Structured Data & Semi-structured Data

Unstructured Data

RDF

webpages

XML

Doc

RDB

Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

Keyword Search “dataspaces”

Page 52: Managing a Space of Heterogeneous Data

II. Answering Queries that Combine Keywords and Structural Information

Structured Data & Semi-structured Data

Unstructured Data

RDF

webpages

XML

Doc

RDB

Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)

Keyword Search “dataspaces”

Page 53: Managing a Space of Heterogeneous Data

Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries

6 Messages67 Articles

Search ‘Schema Matching’

31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)

Page 54: Managing a Space of Heterogeneous Data

Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements

Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)

Page 55: Managing a Space of Heterogeneous Data

II. Answering Queries that Combine Keywords and Structural Information

Structured Data & Semi-structured Data

Unstructured Data

RDF

webpages

XML

Doc

RDB

Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’

Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)

Keyword Search “dataspaces”

Page 56: Managing a Space of Heterogeneous Data

Indexing Heterogeneous Data

Challenges Index data from heterogeneous data sources Capture both text values and structural

information Traditional Indexes

Build a separate index for each attribute to support structured queries

Build an inverted list to support keyword search XML indexes assume tree models and build

multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)

Page 57: Managing a Space of Heterogeneous Data

Index Heterogeneous Data Using an Inverted List

Desktop

Alon Halevy

Luna Dong

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon

Dong

Halevy

Luna

Semex

Xin

Inverted List

Page 58: Managing a Space of Heterogeneous Data

Desktop

Index Heterogeneous Data Using an Inverted List

Alon Halevy

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon 1

Dong 1 1

Halevy 1

Luna 1

Semex 1

Xin 1

Inverted List

Luna Dong

Query: Dong

Page 59: Managing a Space of Heterogeneous Data

Desktop

Incorporate Attribute Labels in the Inverted List

Alon Halevy

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon 1

Dong 1 1

Halevy 1

Luna 1

Semex 1

Xin 1

Inverted List

Luna Dong

Query: firstName “Dong”

Page 60: Managing a Space of Heterogeneous Data

Desktop

Incorporate Attribute Labels in the Inverted List

Query: firstName “Dong”

Alon Halevy

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon/name/ 1

Dong/name/ 1

Dong/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/lastName/ 1

Inverted List

Luna Dong

Query: firstName “Dong” “Dong/firstName/”

Page 61: Managing a Space of Heterogeneous Data

Desktop

Incorporate Attribute Hierarchy in the Inverted List

Query: name “Dong”

Alon Halevy

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon/name/ 1

Dong/name/ 1

Dong/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/lastName/ 1

Inverted List

Luna Dong

Page 62: Managing a Space of Heterogeneous Data

Desktop

Incorporate Attribute Hierarchy in the Inverted List

Query: name “Dong”

Alon Halevy

Semex: …authoredPaper

author

authoredPaper

author

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Alon/name/ 1

Dong/name/ 1

Dong/name/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/name/lastName/ 1

Inverted List

Luna Dong

Query: name “Dong” “Dong/name/*”

name

firstName lastName

Page 63: Managing a Space of Heterogeneous Data

Desktop

Incorporate Association Labels in the Inverted List

Query: author “Dong”

Alon Halevy

authoredPaper

author

authoredPaper

StuID lastName firstName …

1000001 Xin Dong …

… … … …

Departmental Database

Inverted List

Luna Dong

Semex: …

Alon/name/ 1

Dong/name/ 1

Dong/name/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/title/ 1

Xin/name/lastName/ 1

author

Page 64: Managing a Space of Heterogeneous Data

Desktop

Incorporate Association Labels in the Inverted List

Alon Halevy

authoredPaper

author

authoredPaper

author

StuID LastName FirstName …

1000001 Xin Dong …

… … … …

Departmental Database

Inverted List

Luna Dong

Semex: …

Alon/author/ 1

Alon/name/ 1

Dong/author/ 1

Dong/name/ 1

Dong/name/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/authoredPaper/ 1 1

Semex/title/ 1

Xin/name/LastName/ 1

Query: author “Dong”Query: author “Dong” “Dong/author/*”

Page 65: Managing a Space of Heterogeneous Data

Desktop

Answering Neighborhood Keyword Queries

Alon Halevy

authoredPaper

author

authoredPaper

author

StuID LastName FirstName …

1000001 Xin Dong …

… … … …

Departmental Database

Inverted List

Luna Dong

Semex: …

Alon/author/ 1

Alon/name/ 1

Dong/author/ 1

Dong/name/ 1

Dong/name/firstName/ 1

Halevy/name/ 1

Luna/name/ 1

Semex/authoredPaper/ 1 1

Semex/title/ 1

Xin/name/LastName/ 1

Query: SemexQuery: Semex “Semex/*”

Page 66: Managing a Space of Heterogeneous Data

Experimental Setting Data sets

A 50MB personal data set Two 10GB XML data sets: Wikipedia, XMark

Benchmark

Queries: with one predicate or keyword Predicate Query with leaf attributes Predicate Query with branch attributes Predicate Query with associations Neighborhood Keyword Query

Measure: in millisecond Index-lookup time Query-answering time

Page 67: Managing a Space of Heterogeneous Data

Our Indexing Method Significantly Improves Query Answering

Query Type

Plain Inverted List

(10.6MB)

Extended Inverted List

(28.1MB)

Index Lookup

(ms)

Query Answer

(ms)

Index Lookup

(ms)

Query Answer

(ms)

Pred Query with leaf

attributes2 22 4 6

Pred Query with branch attributes

3 43 4 6

Pred Query with

associations3 88 6 17

Neighborhood Keyword

Query18 4174 48 97

Page 68: Managing a Space of Heterogeneous Data

Our Indexing Method Scales Well

WikipediaXMark

w/o assoXMark

with asso

Index 4.15hr(1.13GB)

6.64hr(3.04GB)

12.72hr(4.08GB)

Pred Query with leaf

attributes156 94 116

Pred Query with branch attributes

- 67 93

Pred Query with

associations- - 217

Neighborhood Keyword

Query1646 1838 13468

Page 69: Managing a Space of Heterogeneous Data

Heterogeneity at Different Levels

Name:First: Luna Last: DongE-Mail Addresses: [email protected]

@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}

Instance level• Reference Reconciliation

[Sigmod’05]

Query level• Answering structured

queries on unstructured data [WebDB’06]

• Indexing heterogeneous data [Sigmod’07]

Schema level• Probabilistic schema

mapping(VLDB’07)

• Visualization of heterogeneous data (InfoVis’07)

Page 70: Managing a Space of Heterogeneous Data

Probabilistic Schema Mapping S=(pname, email-addr, home-addr, office-

addr)

T=(name, mailing-addr)

Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1

Page 71: Managing a Space of Heterogeneous Data

By-Table v.s. By-Tuple Semantics

Page 72: Managing a Space of Heterogeneous Data

By-Table v.s. By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale San Jose

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

Alice Sunnyvale

Bob San Jose

name

mailing-addr

Alice alice@

Bob bob@ 0.5 0.4 0.1

Page 73: Managing a Space of Heterogeneous Data

By-Table v.s. By-Tuple Semantics

pname

email-addr

mailing-addr

office-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale San Jose

Ds=

name

mailing-addr

AliceMountain

View

Bob San Jose

DT=nam

emailing-

addr

AliceSunnyval

e

Bob San Jose

name

mailing-addr

AliceSunnyval

e

Bob bob@ 0.2 0.16 0.04

Page 74: Managing a Space of Heterogeneous Data

Theoretical Results Query answering in by-table semantics

In PTIME in the size of the data Query answering in by-tuple

semanticsIn general #P-complete in the size of the

dataIn PTIME for two types of queries

The query contains a single table that is a target in a probabilistic mapping

If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute

Page 75: Managing a Space of Heterogeneous Data

More Theoretical Results Query answering in both semantics is

in PTIME in the size of the probabilistic mapping

Compress representations of probabilistic mappings We propose two compact representations of

probabilistic mappings, such that query answering is still in PTIME in the size of the mapping

When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping

Page 76: Managing a Space of Heterogeneous Data

Conclusions Goal: Provide quality search, querying

and browsing for dataspaces Thesis Contributions

An algorithm for reference reconciliation An indexing method for supporting queries

that combine keywords and structure An algorithm for answering structured

queries on unstructured data The concept and theoretical foundation for

Probabilistic Schema Mapping An approach for visualizing heterogeneous

data A PIM system incorporating the above

Page 77: Managing a Space of Heterogeneous Data

Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis

D1

D2

D3

D4

D5

Mediated Schema

Q

Page 78: Managing a Space of Heterogeneous Data

D1

D2

D3

D4

D5

Future Work II. Manage Dataspaces at the Web-Scale

Page 79: Managing a Space of Heterogeneous Data

Future Work II. Manage Dataspaces at the Web-Scale

Challenges: Large scale and complex domains

Future directions:1. Probabilistic data integration2. Information redundancy3. Universal search

Keyword Search

Page 80: Managing a Space of Heterogeneous Data

Research Methodology

MachineLearning

InformationRetrieval

Database

Theory

1. Semex Personal Information Management System[Sigmod’05 Best Demo]

2. Woogle Web ServiceSearch Engine [VLDB’04]

1. Probabilistic Schema Mapping [VLDB’07]

2. XML Query Containment [VLDB’04]

3. Optimization of Query Difference (Submitted)System

Page 81: Managing a Space of Heterogeneous Data

co-worker

AcknowledgementProject: Semex

advisor co-worker

ArticleAbout

ArticleAbout

ArticleAboutCIDR

publishedIn

publishedIn

publishedIn

StanfordVisual Grp

collaborator

collaborator

Person: Luna

participant

Person: AlonprojectLeader

Person: Jayant

participant

Person: Michelle

Person: Yuhan

participantparticipant

co-worker

Page 82: Managing a Space of Heterogeneous Data
Page 83: Managing a Space of Heterogeneous Data

Our Algorithm Equals or OutperformsAttr-wise Matching in All Classes

Class

Attr-wise Matching

Association Network

Precision

RecallPrecisi

onRecall

Person

Article

Venue

0.8720.9970.935

0.7330.9770.790

0.9240.9990.987

0.8750.9760.937

Page 84: Managing a Space of Heterogeneous Data

Results on Cora Dataset is Competitive with Other Reported Results

Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003]

Class

Attr-wise Matching

Dependency Graph

Prec/Recall

F-msre

Prec/Recall F-msre

Article

PersonVenue

0.985/0.913

0.994/0.985

0.982/0.362

0.948 0.9890.529

0.985/0.924

1/0.9870.837/0.71

4

0.954 0.9930.771

Page 85: Managing a Space of Heterogeneous Data

Experiment Settings

Measure: Diversity and DispersionDiversity: For every result partition,

how many real-world objects are included; ideally should be 1 (related to precision)

Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

Page 86: Managing a Space of Heterogeneous Data

Diversity and Dispersion Are Very Close to 1

Dataset#per/#ref

Attr-wise Matching Dependency Graph

Diversity/Dispersion Diversity/Dispersion

A (1750/2407

6)B

(1989/36359)C

(1570/15160)D

(1518/17199)

Avg

1.18/1.0031.067/1.01

1.053/1.0031.041/1.004

1.085/1.005

1.047/1.0031.039/1.0081.03/1.017

1.023/1.005

1.035/1.008

Page 87: Managing a Space of Heterogeneous Data

Our Indexing Method Scales Well

WikipediaXMark

w/o assoXMark

with asso

Index 4.15hr(1.13GB)

6.64hr(3.04GB)

12.72hr(4.08GB)

Pred Query with leaf

attributes156 94 116

Pred Query with branch attributes

- 67 93

Pred Query with

associations- - 217

Neighborhood Keyword

Query1646 1838 13468

Page 88: Managing a Space of Heterogeneous Data

I. Visualizing Heterogeneous Data Current data visualization

Consider only data residing in a single database

Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005])

Visualization of dataspaces need to consider data from heterogeneous sources

Page 89: Managing a Space of Heterogeneous Data

Example Visualization —A Map Marked with Papers

Page 90: Managing a Space of Heterogeneous Data

Example Visualization —A Calendar with Presentation Slides