2014-11 apacheconeu : lizard - clustering an rdf triplestore
TRANSCRIPT
![Page 2: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/2.jpg)
Why Cluster?➢ Service Resilience
○ Failures○ Server admin and security patches
➢ Performance / scale○ More hardware : CPU, RAM, system bus
![Page 3: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/3.jpg)
Why Cluster?➢ Alternatives
○ Full clustering○ Master-slave (load scale only ; update issues)○ Application visible partitioning
![Page 4: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/4.jpg)
Who am I?➢ Committer on Apache Jena
○ Deploy/operate Jena/TDB in ££job.
➢ W3C○ Co-editor on SPARQ 1.0 and 1.1 query lang
spaces○ RDF 1.1 (on syntax, inc. SPARQL alignment)○ ASF’s W3C AC representative
![Page 5: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/5.jpg)
Acknowledgements➢ Apache➢ Partial funding : InnovateUK* ➢ Users
○ For the discussion and encouragement
* Used to be the Technology Strategy Board.UK Department for Business, Innovation & Skills
![Page 6: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/6.jpg)
Outline➢ TDB Design➢ SPARQL Execution➢ Lizard Design➢ Back to SPARQL
![Page 7: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/7.jpg)
TDB➢ Custom RDF Database
○ Quads + Triples
➢ Custom index code○ Threaded B+Trees
➢ Dictionary terms○ NodeId – 8 bytes○ Inline terms (numbers, dates, dateTimes)
![Page 8: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/8.jpg)
TDB : Node Table
![Page 9: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/9.jpg)
TDB : Indexes➢ Indexes are covering
○ Range scans○ All key, no value○ No "triple table"
![Page 10: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/10.jpg)
SPARQL Execution{ ?x :p 123 . }
Convert to NodeIds
Look in POS to get all PO?, assign S to ?x
123 is an inline constant in TDB.
{ ?x :p 123 . ?x :q ?v . }
A database joinIndex join (Loop+substitution)
Index join (= loop) on :x1 :q ?vwhere :x1 is the value of ?x
![Page 11: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/11.jpg)
Index Implementation➢ TDB uses threaded B+Trees for indexes
○ 8K blocks 100-way B+Tree○ Threaded : scan only touches data blocks
SPO SPO SPO ------ ------ ------
Ptr Ptr ------ ------ ------
SPO SPO SPO SPO ------ ------
Ptr Ptr Ptr ------ ------
SPO SPO SPO SPO SPO SPO SPO SPO SPO SPO ------ ------
![Page 12: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/12.jpg)
Choices
Where to introduce distribution?
Query and Update
Indexes / B+Trees Node table / Objects
Blocks Key → Value Store
![Page 13: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/13.jpg)
This Does Not Work (very well)
➢ Impedance mismatch○ Too much data moving about○ Little parallelism○ Bad cold-start
Distribute the storageK->V store
Index access on query processor
Query and Update
B+Trees Objects
Blocks Key→Value
![Page 14: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/14.jpg)
Distribute
➢ Distribute the indexes○ With modified index access
➢ Distribute the nodes➢ Comms : Apache Thrift
Query and Update
B+Trees Objects
Blocks Key→Value
![Page 15: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/15.jpg)
Clustered Node Table➢ Node Table
○ N replicas; Read R / Write We.g. W=N and R =1 =>Complete copies of node table on each data server
○ ReplaceableRequirement: NodeId for naming
![Page 16: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/16.jpg)
Clustered Indexes➢ Indexes
○ Shard by subject○ Replica shards○ Compound access operations
![Page 17: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/17.jpg)
Clustered IndexesIndex
Shard 1 Shard 2 Shard 3
Machine 1 Machine 2
![Page 18: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/18.jpg)
Configuration 1
![Page 19: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/19.jpg)
Configuration 2
![Page 20: 2014-11 ApacheConEU : Lizard - Clustering an RDF TripleStore](https://reader035.vdocuments.net/reader035/viewer/2022062514/559f69ed1a28ab021e8b4658/html5/thumbnails/20.jpg)
Modified SPARQL Execution➢ Different unit of index access
○ subject + several predicates(subj, pred1, pred2, pred3, …)
➢ Different join algorithms○ Merge join○ Parallel hash join