Download - Scalable SPARQL Querying of Large RDF Graphs
![Page 1: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/1.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Scalable SPARQL Querying of Large RDF Graphs
Xu Bo
2012.06.11
In PVLDB, 4(21), 2011
![Page 2: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/2.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
OutlineOutline• About Presenter
• Semantic Web
• Previous Work
• New Problem
• SYSTEM ARCHITECTURE
• EXPERIMENTS
• CONCLUSIONS AND FUTURE WORK
23/4/20 2
![Page 3: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/3.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
About Presenter
• Daniel Abadi• Associate Professor of Computer Science in Yale University• Research
– Column-Oriented Database Systems– Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management
23/4/20 3
![Page 4: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/4.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Semantic WebSemantic Web
• The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web
23/4/20 4
![Page 5: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/5.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Google Knowledge Graph
23/4/20 5
![Page 6: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/6.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Key Technology
• HTML
• XML
23/4/20 6
![Page 7: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/7.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics.
• there is no standard way of assigning meaning to tag nesting
23/4/20 7
![Page 8: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/8.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The Disadvantage of Xpath
• Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember
• XML is semantically unsatisfactory
23/4/20 8
![Page 9: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/9.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF
• Resource Description Framework• 用 Web标识符(称作统一资源标识符, Uniform
Resource Identifiers 或 URIs)来标识事物,用简单的属性( property)及属性值来描述资源
23/4/20 9
![Page 10: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/10.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF as Triples and a Graph
23/4/20 10
![Page 11: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/11.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
SPARQL
• RDF query language
• A basic graph pattern
• Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern
23/4/20 11
![Page 12: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/12.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Example for Star Pattern
• Find the names of the strikers that play for FC Barcelona.
23/4/20 12
![Page 13: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/13.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Another Example• Find football players playing for clubs in apopulous region where they were born.
23/4/20 13
![Page 14: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/14.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
23/4/20 14
![Page 15: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/15.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Previous Work• RDF In RDBMSs
• Property Tables
• Vertically Partitioned Approach
23/4/20 15
![Page 16: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/16.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF In RDBMSs• Get the title of the book(s) Joe Fox wrote
in 2001
23/4/20 16
![Page 17: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/17.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Property Tables
23/4/20 17
![Page 18: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/18.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Vertically Partitioned Approach
23/4/20 18
![Page 19: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/19.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
New Problem• Single node RDF management systems are
abundant– Sesame– Jena– RDF3X– 3store
• Research in clustered RDF management is less significantly explored: The focus of the talk
23/4/20 19
![Page 20: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/20.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
SYSTEM ARCHITECTURE
23/4/20 20
![Page 21: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/21.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Graph Partitioning• Hash vs. Graph partitioning
– Hash: Only efficient for star patterns– Graph: Taking advantage of graph model
23/4/20 21
![Page 22: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/22.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Graph Partitioning• Edge vs. Vertex partitioning
– Edge: Natural but inefficient for query execution
– Vertex: Superior for common graph patterns
23/4/20 22
![Page 23: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/23.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Vertex Partitioning• Preprocess
– remove triples whose predicate is rdf:type
• METIS partitioner
23/4/20 23
![Page 24: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/24.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Triple Placement• Minimizing data shuffling/exchange
– Allowing data overlap
• N-hop guarantee– The extent of data overlap– If a vertex is assigned to a machine, any
vertex that is within n-hop of this vertex is also stored in this machine
23/4/20 24
![Page 25: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/25.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
DIRECTED N-HOP GUARANTEE
23/4/20 25
![Page 26: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/26.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
A potential problem• triples (s, p, o) and (o, p’, o’)
– 2-hop guarantee
• triples (s, p, o) and (s’, p’, o)– not guaranteed
• “object-connected” is not unusual
• undirected n-hop guarantee
23/4/20 26
![Page 27: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/27.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Triple Placement Algorithm
23/4/20 27
![Page 28: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/28.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Query Processing• Queries are executed in RDF-stores and/or
Hadoop• Query execution is more efficient in RDF-
stores than in Hadoop– Pushing as much of the processing as possible
into RDF-stores– Minimizing the number of Hadoop jobs– The larger the hop guarantee, the more work is
done in RDF-stores23/4/20 28
![Page 29: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/29.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
To Communicate, or not to Communicate
• Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed?– Choose the “center” of the query graph– Calculate the distance from the “center” to the
furthest edge– If distance > n, communication is needed; not
needed otherwise
23/4/20 29
![Page 30: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/30.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Determining whether a Query is PWOC
PWOC Query– parallelizable without communication
• DoFE– distance of farthest edge– the vertex in a graph with the smallest DoFE will be
the most central in a graph
23/4/20 30
![Page 31: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/31.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The algorithm
23/4/20 31
![Page 32: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/32.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
the issue of duplicate results• naive approach
– remove duplicates after the query has completed
• owner-computes model– add triples (v, ‘<isOwned>’, ‘Yes’) to a
partition– For each query issued to the RDF-stores, add
an additional pattern (core, ‘<isOwned>’, ‘Yes’)
23/4/20 32
![Page 33: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/33.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
A query is not PWOC• decompose the query into PWOC
subqueries
• use Hadoop jobs to join the results of the PWOC subqueries
• The number of Hadoop jobs required to complete the query increases as the number of subqueries increases
23/4/20 33
![Page 34: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/34.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
minimal number of subqueries• reduces to the problem of finding minimal
edge partitioning of a graph into subgraphs of bounded diameter
• brute-force
23/4/20 34
![Page 35: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/35.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Examlple
23/4/20 35
DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1
the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,
![Page 36: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/36.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Decompose Example
23/4/20 36
![Page 37: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/37.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
EXPERIMENTS• 20-machine cluster
• Leigh University Benchmark (LUBM): 270 million triples
• Competitors:– Single-node RDF-3X– SHARD: triple-store system in Hadoop– Graph partitioning (the proposed system)– Hash partitioning on subjects
23/4/20 37
![Page 38: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/38.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Data Load Time
23/4/20 38
![Page 39: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/39.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Performance Comparison
23/4/20 39
![Page 40: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/40.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Varying Number of Machines
23/4/20 40
![Page 41: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/41.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Summary
23/4/20 41
![Page 42: Scalable SPARQL Querying of Large RDF Graphs](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f6a550346895daa3ef0/html5/thumbnails/42.jpg)
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
23/4/20 42