Transcript
Page 1: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Scalable SPARQL Querying of Large RDF Graphs

Xu Bo

2012.06.11

In PVLDB, 4(21), 2011

Page 2: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

OutlineOutline• About Presenter

• Semantic Web

• Previous Work

• New Problem

• SYSTEM ARCHITECTURE

• EXPERIMENTS

• CONCLUSIONS AND FUTURE WORK

23/4/20 2

Page 3: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

About Presenter

• Daniel Abadi• Associate Professor of Computer Science in Yale University• Research

– Column-Oriented Database Systems– Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management

23/4/20 3

Page 4: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Semantic WebSemantic Web

• The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web

23/4/20 4

Page 5: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Google Knowledge Graph

23/4/20 5

Page 6: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Key Technology

• HTML

• XML

23/4/20 6

Page 7: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics.

• there is no standard way of assigning meaning to tag nesting

23/4/20 7

Page 8: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The Disadvantage of Xpath

• Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember

• XML is semantically unsatisfactory

23/4/20 8

Page 9: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF

• Resource Description Framework• 用 Web标识符(称作统一资源标识符, Uniform

Resource Identifiers 或 URIs)来标识事物,用简单的属性( property)及属性值来描述资源

23/4/20 9

Page 10: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF as Triples and a Graph

23/4/20 10

Page 11: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

SPARQL

• RDF query language

• A basic graph pattern

• Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern

23/4/20 11

Page 12: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Example for Star Pattern

• Find the names of the strikers that play for FC Barcelona.

23/4/20 12

Page 13: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Another Example• Find football players playing for clubs in apopulous region where they were born.

23/4/20 13

Page 14: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

23/4/20 14

Page 15: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Previous Work• RDF In RDBMSs

• Property Tables

• Vertically Partitioned Approach

23/4/20 15

Page 16: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF In RDBMSs• Get the title of the book(s) Joe Fox wrote

in 2001

23/4/20 16

Page 17: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Property Tables

23/4/20 17

Page 18: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Vertically Partitioned Approach

23/4/20 18

Page 19: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

New Problem• Single node RDF management systems are

abundant– Sesame– Jena– RDF3X– 3store

• Research in clustered RDF management is less significantly explored: The focus of the talk

23/4/20 19

Page 20: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

SYSTEM ARCHITECTURE

23/4/20 20

Page 21: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Graph Partitioning• Hash vs. Graph partitioning

– Hash: Only efficient for star patterns– Graph: Taking advantage of graph model

23/4/20 21

Page 22: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Graph Partitioning• Edge vs. Vertex partitioning

– Edge: Natural but inefficient for query execution

– Vertex: Superior for common graph patterns

23/4/20 22

Page 23: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Vertex Partitioning• Preprocess

– remove triples whose predicate is rdf:type

• METIS partitioner

23/4/20 23

Page 24: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Triple Placement• Minimizing data shuffling/exchange

– Allowing data overlap

• N-hop guarantee– The extent of data overlap– If a vertex is assigned to a machine, any

vertex that is within n-hop of this vertex is also stored in this machine

23/4/20 24

Page 25: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

DIRECTED N-HOP GUARANTEE

23/4/20 25

Page 26: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

A potential problem• triples (s, p, o) and (o, p’, o’)

– 2-hop guarantee

• triples (s, p, o) and (s’, p’, o)– not guaranteed

• “object-connected” is not unusual

• undirected n-hop guarantee

23/4/20 26

Page 27: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Triple Placement Algorithm

23/4/20 27

Page 28: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Query Processing• Queries are executed in RDF-stores and/or

Hadoop• Query execution is more efficient in RDF-

stores than in Hadoop– Pushing as much of the processing as possible

into RDF-stores– Minimizing the number of Hadoop jobs– The larger the hop guarantee, the more work is

done in RDF-stores23/4/20 28

Page 29: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

To Communicate, or not to Communicate

• Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed?– Choose the “center” of the query graph– Calculate the distance from the “center” to the

furthest edge– If distance > n, communication is needed; not

needed otherwise

23/4/20 29

Page 30: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Determining whether a Query is PWOC

PWOC Query– parallelizable without communication

• DoFE– distance of farthest edge– the vertex in a graph with the smallest DoFE will be

the most central in a graph

23/4/20 30

Page 31: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The algorithm

23/4/20 31

Page 32: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

the issue of duplicate results• naive approach

– remove duplicates after the query has completed

• owner-computes model– add triples (v, ‘<isOwned>’, ‘Yes’) to a

partition– For each query issued to the RDF-stores, add

an additional pattern (core, ‘<isOwned>’, ‘Yes’)

23/4/20 32

Page 33: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

A query is not PWOC• decompose the query into PWOC

subqueries

• use Hadoop jobs to join the results of the PWOC subqueries

• The number of Hadoop jobs required to complete the query increases as the number of subqueries increases

23/4/20 33

Page 34: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

minimal number of subqueries• reduces to the problem of finding minimal

edge partitioning of a graph into subgraphs of bounded diameter

• brute-force

23/4/20 34

Page 35: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Examlple

23/4/20 35

DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1

the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,

Page 36: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Decompose Example

23/4/20 36

Page 37: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

EXPERIMENTS• 20-machine cluster

• Leigh University Benchmark (LUBM): 270 million triples

• Competitors:– Single-node RDF-3X– SHARD: triple-store system in Hadoop– Graph partitioning (the proposed system)– Hash partitioning on subjects

23/4/20 37

Page 38: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Data Load Time

23/4/20 38

Page 39: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Performance Comparison

23/4/20 39

Page 40: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Varying Number of Machines

23/4/20 40

Page 41: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Summary

23/4/20 41

Page 42: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

23/4/20 42


Top Related