sti summit 2011 - db vs rdf

13
DB vs RDF: structure vs correlation +comments on data integration & SW Peter Boncz Senior Research Scientist @ CWI Lecturer @ Vrije Universiteit Amsterdam Architect & Co-founder MonetDB Architect & Co-founder VectorWise 1 STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Upload: semantic-technology-institute-international

Post on 11-Nov-2014

254 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: STI Summit 2011 - DB vs RDF

DB vs RDF: structure vs correlation +comments on data integration & SW

Peter Boncz

Senior Research Scientist @ CWI Lecturer @ Vrije Universiteit Amsterdam

Architect & Co-founder MonetDB Architect & Co-founder VectorWise

1 STI Summit, July 6, 2011 Riga. DB vs

RDF: structure vs correlation

Page 2: STI Summit 2011 - DB vs RDF

2

Correlations

Real data is highly correlated }  (gender, location) ó firstname }  (profession,age) ó income But… database systems assume attribute independence }  wrong assumption leads to suboptimal query plan

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 3: STI Summit 2011 - DB vs RDF

3

Example: co-authorship query

DBLP “find authors who published in VLDB, SIGMOD and Bioinformatics”

SELECT author FROM publications p1, p2, p3 WHERE p1.venue = VLDB and

p2.venue = SIGMOD and p3.venue = Bioinformatics and p1.author = p2.author and p1.author = p3.author and p2.author = p3.author

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

DBLP “find authors who published in VLDB, SIGMOD and Bioinformatics”

SELECT author FROM publications p1, p2, p3 WHERE p1.venue = VLDB and

p2.venue = SIGMOD and p3.venue = Bioinformatics and p1.author = p2.author and p1.author = p3.author and p2.author = p3.author

Page 4: STI Summit 2011 - DB vs RDF

4

Correlations: RDBMS vs RDF

Schematic structure in RDF data hidden in correlations, which are everywhere

SPARQL leads to plans with }  many self-joins }  whose hit-ratio is correlated (with eg selections) Relational query optimizers do not have a clue è all self-joins look the same to it è random join order, bad query plans L

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 5: STI Summit 2011 - DB vs RDF

5

RDF Engines to the Next Level

Challenge: solving the correlation problem. Ideas? }  Interleave optimization and execution

}  Run-time sampling to detect true selectivities }  Run-time query (re-)planning

}  Tackling when there are long join chains }  Creating partial path indexes }  Graph “cracking”: index build as side-effect of query }  .

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 6: STI Summit 2011 - DB vs RDF

6

Stratos Idreos: database cracking

Challenge: solving the correlation problem. Ideas? }  Interleave optimization and execution

}  Run-time sampling to detect true selectivities }  Run-time query (re-)planning

}  Tackling when there are long join chains }  Creating partial path indexes }  Graph “cracking”: index build as side-effect of query }  .

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 7: STI Summit 2011 - DB vs RDF

7

RDF Engines to the Next Level

Challenge: solving the correlation problem. Ideas? }  Interleave optimization and execution

}  Run-time sampling to detect true selectivities }  Run-time query (re-)planning

}  Tackling when there are long join chains }  Creating partial path indexes }  Graph “cracking”: index build as side-effect of query }  “recycling” intermediate results

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 8: STI Summit 2011 - DB vs RDF

8

RDF Engines to the Next Level

Benchmarking in LOD2: }  Attempting to engage vendors to collaborate

“a TPC for RDF”

}  New and more challenging benchmarks “Suitably designed benchmarks drive progress”

benchmarks with: }  complex query patterns on large data }  geo and text data and queries }  outside Linked Open Datasets that get joined to synthetic data }  highly interlinked graph structures and queries on them }  correlated query predicates

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 9: STI Summit 2011 - DB vs RDF

9

Social Intelligence Benchmark (SIB)

RDF-friendly benchmark simulating a huge social network }  Social Graphs have understandable, interesting, scenarios }  Social Graphs are highly connected

}  LOD in the wild not yet + Exploiting knowledge bases from interlinked RDF datasets è not only synthetic data, also linking out to DBpedia

conversation topics, real world concepts, geographical information,

connectivity, social network analysis Data correlations galore

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 10: STI Summit 2011 - DB vs RDF

10

thoughts on.. Information Integration

Data integration and Schema Integration }  Different applications, organizations, time motives hard problem, tens of B$/y

}  Has been on the DB R&D menu for 20+ years

}  AI complete, immature tech }  hard to achieve high precision automatically

}  Semantic Web does not solve this issue }  In fact, it is its major hurdle to success }  Information integration != {Reasoning,Inference,Logic}

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 11: STI Summit 2011 - DB vs RDF

11

The schema.org Approach

choose }  web-addressable, machine-readable schema+data over }  ragged, graph, schema-last RDF data model Approach: }  schema-first, centralized, controlled }  well-defined use case (web search = ad money)

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 12: STI Summit 2011 - DB vs RDF

12

The Watson Approach

Winning Jeopardy! Is pretty cool No central role for reasoning, inference there Recipe: }  Finding statistical evidence in Big Data }  Using semantics for focused sub-tasks (only) }  Intelligently combining multiple approaches }  Focusing on the Jeopardy! problem at hand

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation

Page 13: STI Summit 2011 - DB vs RDF

13

DBMS Architecture for new HW

“hardware-conscious database architectures” }  focus on exploiting modern hardware }  platform independent, flexible }  highest performance per core }  green }  scalability to ~10 TB, currently

examples:

STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation