managing a space of heterogeneous data xin (luna) dong university of washington march, 2007
TRANSCRIPT
Managing a Space of Heterogeneous Data
Xin (Luna) DongUniversity of Washington
March, 2007
Once upon a time…
Nowadays…
D1
D2
D3
D4
D5
Mappings Between Heterogeneous Data Sources
Name Length Status Price Rate
The Departed
…
151 mins
…
In stock
…
$34.99
…
Excellent
…
MovieDVD
ID Title Year Genre Runtime Director
15827 The Departed 2006 Crime 151 min 32468
Movie
DirectorID Name
32468 Martin Scorsese
Director
MovieID Review
15827 Martin Scorsese Hits the Streets Again!
Review
Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front
D1
D2
D3
D4
D5
Mediated Schema
QQQ
Q1
Q2Q4
Q
Q2Q2
Q5
Q3
In Many Applications it is Hard to Obtain Precise Semantic Mappings
D1
D2
D3
D4
D5?
Scenario 1. Different Websites About Movies
IntranetInternet
Scenario 2. Personal Information Space
In Many Applications it is Hard to Obtain Precise Semantic Mappings
D1
D2
D3
D4
D5
Mediated Schema
Q
Managing Dataspaces
Dataspaces [Halevy et al., PODS’06]Collections of heterogeneous data
sourcesNot necessarily include semantic
mappingsScenarios: personal information,
enterprises, government agencies, smart homes, digital libraries, and the Web
My goal: Provide quality search, querying and browsing as the system evolves
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Heterogeneity at Instance LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity The same real-world object
can be referred to using different attribute values
Current work Record linkage: most works
assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006])
Contributions Reference reconciliation:
reconcile instances of multiple classes and with only limited attributes [Sigmod’05]
Heterogeneity at Schema LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity The same domain can be
described using different schemas
Data can be (semi-)structured or unstructured
Current work Schema matching (Surveyed
in [Rahm&Bernstein, 2001]) Query reformulation
(Surveyed in [Halevy 2000]) Contributions
Probabilistic schema mapping [VLDB’07]
Visualizing heterogeneous data [InfoVis’07]
Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity Different terms and
different levels of structural details
Keyword search: ‘Semex Dong’
Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)
Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity Different terms and different
levels of structural detailsKeyword search: ‘Semex Dong’
Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)
Current work Keyword search on databases
(Discover, DBExplorer, etc.) Contributions
Seamless querying of structured and unstructured data
Indexing heterogeneous data [Sigmod’07]
Answering structured queries on unstructured data [WebDB’06]
Outline Problem definition and goals Semex Personal Information
Management System [CIDR’05, one of three Best Demos at Sigmod’05]
Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on
unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis
2007] Future research directions
OriginitatedFrom
PublishedIn
ConfHomePageExperimentOf
ArticleAbout
BudgetOf
CourseGradeIn
AddressOf
Cites
CoAuthor
FrequentEmailer
HomePage
Sender
EarlyVersion
Recipient
AttachedTo
PresentationFor
ComeFrom
Semex Generates a Logical View of Meaningful Objects and Associations
Semex Provides Association Browsing of One’s Personal Information
Names
Emails
Alon. Y. Levy
Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information
Management and Integration
Title
Year
Semex Provides Association Browsing of One’s Personal Information
CIDR
Semex Provides Association Browsing of One’s Personal Information
Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage
Question 1: Which emails has my advisor sent me about my thesis?
[email protected]@[email protected]@transformic.com
Question 2: Who have been working on schema matching?
6 Messages67 Articles
31 Persons working on Schema Matching (e.g., Alon Halevy, Phil
Bernstein, Renee Miller, Anhai Doan)
Search ‘Schema Matching’
Question 3: Which of my friends published in Sigmod 2007?
My friends who published papers in
Sigmod 2007
Data Integration Module
SchemaManagement
Module
DomainModel
ReferenceReconciliater
Association DB
Extractors
Indexer Index
ObjectsAssociations
Word PPT PDF Latex Email Webpage Excel DB
Integrator
Searcher Browser Analyzer
DomainManager
Data Analysis Module
DomainModel
ReferenceReconciliater
Association DB
Extractors
Indexer Index
ObjectsAssociations
Word PPT PDF Latex Email Webpage Excel DB
Integrator
Searcher Browser Analyzer
Semex Architecture
DomainManager
Outline Problem definition and our principle Semex Personal Information
Management System [CIDR’05, one of three Best Demos at Sigmod’05]
Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on
unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis
2007] Future research directions
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Instance level• Reference Reconciliation
[Sigmod’05]
Query level• Answering structured
queries on unstructured data [WebDB’06]
• Indexing heterogeneous data [Sigmod’07]
Schema level• Probabilistic schema
mapping[VLDB’07]
• Visualization of heterogeneous data [InfoVis’07]
Reference Reconciliation is Crucial in Dataspaces
Xin (Luna) Dong
xin dong
•¶ ðà xinluna dong
luna
dongxin
x. dong
Lab-#dong xin
dong xin luna
Names
Emails
Previous Approaches
A very active area of research in databases, data mining and AI
Most current approaches assume matching tuples from a single database tableTraditional approaches are based on pair-
wise comparisons (Surveyed in [Winkler, 2006])
New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004]
Harder for a complex information space
Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”,
{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,
{p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},
c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},
c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)
1. MultipleClasses 3. Multi-value
Attributes
2. Limited Information
?
?
Intuition: Exploit Association Network
We extract from dataspaces networks of instances and associations between the instances
Key: exploit the network, specifically, the clues hidden in the associations
Strategy I. Exploiting Richer Evidence Cross-attribute similarity –
Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)
Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7
Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article
Considering Only Attribute-wise Similarities Cannot Merge Persons Well
1750
1950
2150
2350
2550
2750
2950
3150
3350
1 2 3 4
Evidence
#(P
ers
on
Par
titi
on
s)
Person references: 24076 Real-world persons (gold-standard):1750
3159
1409
1750
Considering Richer Evidence Improves the Result
3159
2169 21692096
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
1409
346
Person references: 24076 Real-world persons:1750
1750
Strategy II. Propagate Information Between Reconciliation Decisions Article: a1=(“Distributed Query Processing”,“169-180”,
{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,
{p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
3159
2169 21692096
3159
2146 2135
2022
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-w ise Name&Email Article Contact
Evidence
#(Pe
rson
Par
titio
ns)
Traditional Propagation
Propagating Information Between Reconciliation Decisions Further Improves the Result
Person references: 24076 Real-world persons:1750
1409
272346
1750
Strategy III. Reference Enrichment p2=(“Michael Stonebraker”, null,
{p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)
p8-9 =(“mike”, “[email protected]”, {p7})
V
XXV
References Enrichment Improves the Result More than Information Propagation
3159
2169 21692096
3169
2036 2036
19101750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
Traditional Enrichment Propagation
Person references: 24076 Real-world persons:1750
1409
160346
1750
3159
2169 21692096
3169
2002 1990
18731750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
artit
ions
)
Traditional Enrichment Propagation Full
Applying Both Information Propagation and Reference Enrichment Gets the Best Result
Person references: 24076 Real-world persons:1750
1409
125346
1750
Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for
all data sets Measure
Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs)
Recall: #(correctly reconciled reference pairs)#(reference pairs that refer to the same real-world object)
F-measure: 2·Precision·Recall Precision+Recall
Precision and Recall Increase Largely Compared with Attr-wise Matching
Dataset
Attr-wise Matching Association Network
Precision
Recall FPrecisi
onRecall F
ABCD
Avg
0.9950.81
0.9870.694
0.872
0.5090.8030.7820.837
0.733
0.6730.8060.8730.759
0.778
0.9820.9580.8140.942
0.924
0.9470.8910.9250.737
0.875
0.9640.9230.8670.827
0.895
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Instance level• Reference Reconciliation
[Sigmod’05]
Query level• Answering structured
queries on unstructured data [WebDB’06]
• Indexing heterogeneous data [Sigmod’07]
Schema level• Probabilistic schema
mapping[VLDB’07]
• Visualization of heterogeneous data [InfoVis’07]
Seamless Querying of Structured and Unstructured Data
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword Search “dataspaces”
I. Answering Structured Queries on Unstructured Data
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword Search “dataspaces”
DB
DB
IR
?
Our approach: query translation Transform a structured query into keyword search Keyword search on unstructured data
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
select title from paper where title LIKE +dataspaces and year +2005
Top-10Precision
0
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
title paper title +dataspaces year +2005
Top-10Precision
0
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
+dataspaces +2005Top-10
Precision
0.2
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
+dataspaces +2005 paper titleTop-10
Precision
0.2
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
+dataspaces +2005 paperTop-10
Precision
0.6
II. Answering Queries that Combine Keywords and Structural Information
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword Search “dataspaces”
II. Answering Queries that Combine Keywords and Structural Information
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)
Keyword Search “dataspaces”
Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries
6 Messages67 Articles
Search ‘Schema Matching’
31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)
Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements
Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)
II. Answering Queries that Combine Keywords and Structural Information
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)
Keyword Search “dataspaces”
Indexing Heterogeneous Data
Challenges Index data from heterogeneous data sources Capture both text values and structural
information Traditional Indexes
Build a separate index for each attribute to support structured queries
Build an inverted list to support keyword search XML indexes assume tree models and build
multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)
Index Heterogeneous Data Using an Inverted List
Desktop
Alon Halevy
Luna Dong
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon
Dong
Halevy
Luna
Semex
Xin
Inverted List
Desktop
Index Heterogeneous Data Using an Inverted List
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon 1
Dong 1 1
Halevy 1
Luna 1
Semex 1
Xin 1
Inverted List
Luna Dong
Query: Dong
Desktop
Incorporate Attribute Labels in the Inverted List
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon 1
Dong 1 1
Halevy 1
Luna 1
Semex 1
Xin 1
Inverted List
Luna Dong
Query: firstName “Dong”
Desktop
Incorporate Attribute Labels in the Inverted List
Query: firstName “Dong”
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon/name/ 1
Dong/name/ 1
Dong/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/lastName/ 1
Inverted List
Luna Dong
Query: firstName “Dong” “Dong/firstName/”
Desktop
Incorporate Attribute Hierarchy in the Inverted List
Query: name “Dong”
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon/name/ 1
Dong/name/ 1
Dong/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/lastName/ 1
Inverted List
Luna Dong
Desktop
Incorporate Attribute Hierarchy in the Inverted List
Query: name “Dong”
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon/name/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/name/lastName/ 1
Inverted List
Luna Dong
Query: name “Dong” “Dong/name/*”
name
firstName lastName
Desktop
Incorporate Association Labels in the Inverted List
Query: author “Dong”
Alon Halevy
authoredPaper
author
authoredPaper
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Inverted List
Luna Dong
Semex: …
Alon/name/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/name/lastName/ 1
author
Desktop
Incorporate Association Labels in the Inverted List
Alon Halevy
authoredPaper
author
authoredPaper
author
StuID LastName FirstName …
1000001 Xin Dong …
… … … …
Departmental Database
Inverted List
Luna Dong
Semex: …
Alon/author/ 1
Alon/name/ 1
Dong/author/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/authoredPaper/ 1 1
Semex/title/ 1
Xin/name/LastName/ 1
Query: author “Dong”Query: author “Dong” “Dong/author/*”
Desktop
Answering Neighborhood Keyword Queries
Alon Halevy
authoredPaper
author
authoredPaper
author
StuID LastName FirstName …
1000001 Xin Dong …
… … … …
Departmental Database
Inverted List
Luna Dong
Semex: …
Alon/author/ 1
Alon/name/ 1
Dong/author/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/authoredPaper/ 1 1
Semex/title/ 1
Xin/name/LastName/ 1
Query: SemexQuery: Semex “Semex/*”
Experimental Setting Data sets
A 50MB personal data set Two 10GB XML data sets: Wikipedia, XMark
Benchmark
Queries: with one predicate or keyword Predicate Query with leaf attributes Predicate Query with branch attributes Predicate Query with associations Neighborhood Keyword Query
Measure: in millisecond Index-lookup time Query-answering time
Our Indexing Method Significantly Improves Query Answering
Query Type
Plain Inverted List
(10.6MB)
Extended Inverted List
(28.1MB)
Index Lookup
(ms)
Query Answer
(ms)
Index Lookup
(ms)
Query Answer
(ms)
Pred Query with leaf
attributes2 22 4 6
Pred Query with branch attributes
3 43 4 6
Pred Query with
associations3 88 6 17
Neighborhood Keyword
Query18 4174 48 97
Our Indexing Method Scales Well
WikipediaXMark
w/o assoXMark
with asso
Index 4.15hr(1.13GB)
6.64hr(3.04GB)
12.72hr(4.08GB)
Pred Query with leaf
attributes156 94 116
Pred Query with branch attributes
- 67 93
Pred Query with
associations- - 217
Neighborhood Keyword
Query1646 1838 13468
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Instance level• Reference Reconciliation
[Sigmod’05]
Query level• Answering structured
queries on unstructured data [WebDB’06]
• Indexing heterogeneous data [Sigmod’07]
Schema level• Probabilistic schema
mapping(VLDB’07)
• Visualization of heterogeneous data (InfoVis’07)
Probabilistic Schema Mapping S=(pname, email-addr, home-addr, office-
addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
By-Table v.s. By-Tuple Semantics
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
Alice Sunnyvale
Bob San Jose
name
mailing-addr
Alice alice@
Bob bob@ 0.5 0.4 0.1
By-Table v.s. By-Tuple Semantics
pname
email-addr
mailing-addr
office-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
name
mailing-addr
AliceMountain
View
Bob San Jose
DT=nam
emailing-
addr
AliceSunnyval
e
Bob San Jose
name
mailing-addr
AliceSunnyval
e
Bob bob@ 0.2 0.16 0.04
…
Theoretical Results Query answering in by-table semantics
In PTIME in the size of the data Query answering in by-tuple
semanticsIn general #P-complete in the size of the
dataIn PTIME for two types of queries
The query contains a single table that is a target in a probabilistic mapping
If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute
More Theoretical Results Query answering in both semantics is
in PTIME in the size of the probabilistic mapping
Compress representations of probabilistic mappings We propose two compact representations of
probabilistic mappings, such that query answering is still in PTIME in the size of the mapping
When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping
Conclusions Goal: Provide quality search, querying
and browsing for dataspaces Thesis Contributions
An algorithm for reference reconciliation An indexing method for supporting queries
that combine keywords and structure An algorithm for answering structured
queries on unstructured data The concept and theoretical foundation for
Probabilistic Schema Mapping An approach for visualizing heterogeneous
data A PIM system incorporating the above
Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis
D1
D2
D3
D4
D5
Mediated Schema
Q
D1
D2
D3
D4
D5
Future Work II. Manage Dataspaces at the Web-Scale
Future Work II. Manage Dataspaces at the Web-Scale
Challenges: Large scale and complex domains
Future directions:1. Probabilistic data integration2. Information redundancy3. Universal search
Keyword Search
Research Methodology
MachineLearning
InformationRetrieval
Database
Theory
1. Semex Personal Information Management System[Sigmod’05 Best Demo]
2. Woogle Web ServiceSearch Engine [VLDB’04]
1. Probabilistic Schema Mapping [VLDB’07]
2. XML Query Containment [VLDB’04]
3. Optimization of Query Difference (Submitted)System
co-worker
AcknowledgementProject: Semex
advisor co-worker
ArticleAbout
ArticleAbout
ArticleAboutCIDR
publishedIn
publishedIn
publishedIn
StanfordVisual Grp
collaborator
collaborator
Person: Luna
participant
Person: AlonprojectLeader
Person: Jayant
participant
Person: Michelle
Person: Yuhan
participantparticipant
co-worker
Our Algorithm Equals or OutperformsAttr-wise Matching in All Classes
Class
Attr-wise Matching
Association Network
Precision
RecallPrecisi
onRecall
Person
Article
Venue
0.8720.9970.935
0.7330.9770.790
0.9240.9990.987
0.8750.9760.937
Results on Cora Dataset is Competitive with Other Reported Results
Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003]
Class
Attr-wise Matching
Dependency Graph
Prec/Recall
F-msre
Prec/Recall F-msre
Article
PersonVenue
0.985/0.913
0.994/0.985
0.982/0.362
0.948 0.9890.529
0.985/0.924
1/0.9870.837/0.71
4
0.954 0.9930.771
Experiment Settings
Measure: Diversity and DispersionDiversity: For every result partition,
how many real-world objects are included; ideally should be 1 (related to precision)
Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)
Diversity and Dispersion Are Very Close to 1
Dataset#per/#ref
Attr-wise Matching Dependency Graph
Diversity/Dispersion Diversity/Dispersion
A (1750/2407
6)B
(1989/36359)C
(1570/15160)D
(1518/17199)
Avg
1.18/1.0031.067/1.01
1.053/1.0031.041/1.004
1.085/1.005
1.047/1.0031.039/1.0081.03/1.017
1.023/1.005
1.035/1.008
Our Indexing Method Scales Well
WikipediaXMark
w/o assoXMark
with asso
Index 4.15hr(1.13GB)
6.64hr(3.04GB)
12.72hr(4.08GB)
Pred Query with leaf
attributes156 94 116
Pred Query with branch attributes
- 67 93
Pred Query with
associations- - 217
Neighborhood Keyword
Query1646 1838 13468
I. Visualizing Heterogeneous Data Current data visualization
Consider only data residing in a single database
Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005])
Visualization of dataspaces need to consider data from heterogeneous sources
Example Visualization —A Map Marked with Papers
Example Visualization —A Calendar with Presentation Slides