workload-aware rdf partitioning and sparql query caching for massive rdf graphs stored in nosql...

Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF

Graphs stored in NoSQL Databases

Simpósio Brasileiro de Banco de Dados (SBBD)Uberlândia, Outubro/2017

Luiz Henrique Zambom SantanaProf. Dr. Ronaldo dos Santos Mello

Agenda● Introduction: Motivation, objectives, and contributions● Background

○ RDF○ NoSQL

● State of the Art● Rendezvous

○ Storing: Fragmentation, Indexing, Partitioning, and Mapping○ Querying: Query decomposition and Caching

● Evaluation

2

Introduction: Motivation● Since the of Semantic Web proposal in 2001, many advances introduced by

W3C● RDF and SPARQL is currently widespread:

○ Best buy:■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwr

iteweb-how-best-buy-is-using-the-semantic-web-23031.html○ Globo.com:

■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013

○ US data.gov:■ https://www.data.gov/developers/semantic-web

3

https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013

https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013

https://www.data.gov/developers/semantic-web

Introduction: Motivation (LOD stats)

4

Introduction: Motivation

● Research problem○ Storing/querying large RDF graphs

■ No single node can handle the complete graph■ Native RDF storage can’t scale to the current data

requirements■ Inter-partitions joins is very costly

● Research hypothesis○ A workload-aware approach based on distributed

polyglot NoSQL persistence could be a good solution 5

Rendezvous

● Triplestore implemented as middleware for storing massing RDF graphs into multiple NoSQL databases

● Novel data partitioning approach● Fragmentation strategy that maps pieces of this RDF

graph into NoSQL databases with different data models

● Caching structure that accelerate the querying response

Introduction: Contributions● Mapping of RDF to columnar, document, and key/value NoSQL models;● A workload-aware partitioner based on the current graph structure and,

mainly, on the typical application workload;● A caching schema based on key/value databases for speeding up the query

response time;● An experimental evaluation that compares the current version of our approach

against two baselines ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and MongoDB, the most popular key/value, columnar and document NoSQL databases, respectively.

7


○ RDF○ NoSQL



● Evaluation

8

Background: RDF and SPARQL

9

Background: NoSQL

● No SQL interface● No ACID transactions● Very scalable● Schemaless

https://db-engines.com/en/ranking

10


○ RDF○ NoSQL



● Evaluation● Schedule

11

State of the Art - TriplestoresTriplestore Frag. Replication Partitioning Model In-memory Workload-aware

Hexastore (2008) No No No Native No No

SW-Store (2009) No No Vertical SQL No No

CumulusRDF (2011)

No No Vertical Columnar (Cassandra)

No No

SPOVC (2012) No No Horizontal Columnar (MonetDB)

No No

WARP (2013) Yes N-hop replication on partition boundary

Hash Native No Dynamic

Rainbow (2015) No No Hash Polyglot K/V cache Static

ScalaRDF (2016) No Next-hop Hash Polyglot K/V cache No

Rendezvous Yes N-hop replication fragment and on partition boundary

V and H Polyglot K/V and local cache

DynamicKey differentials


○ RDF○ NoSQL



● Evaluation● Schedule

13

Rendezvous

● Triplestore implemented as middleware for storing massing RDF graphs into multiple NoSQL databases

● Novel data partitioning approach● Fragmentation strategy that maps pieces of this RDF

graph into NoSQL databases with different data models

● Caching structure that accelerate the querying response

14

Rendezvous: Architecture

15

Workload awareness Middleware core

16

Workload awarenessGiven the graph:

If the following query is issued:

SELECT ?x WHERE {B p2 C . C p3 x?}

SELECT ?x WHERE {F p6 G . F p9 L . F p8 x?}

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D

Star-shaped

... ...

F {Fp6G, Fp9L,Fp8?}

Indexed by the predicateChain-shaped

... ...

p3 {Bp2C,Cp3?}

Indexed by the subject/object

Dataset Characterizer

17

Rendezvous: Storing● Fragmentation

and Mapping● Indexing● Partitioning

18

Star Fragmentation (n-hop expansion)Given the graph and this state

on Dataset Characterizer

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D

F Cp10

Chain-shaped

... ...

p3 {Bp2C,Cp3?}


Star-shaped

... ...

F {Fp6G, Fp9L,Fp8?}

F tends to be in star queries with diameter 1,so we expand the triple Fp10C to a 1-hop fragment

B C

F G

LHI

p5 p6

p7 p9p8

p10

Fp10C will be stored

19

Star Fragmentation (mapping)

With the expanded fragment

B C

F G

LHI

p5 p6

p7 p9p8

p10{subject: F,p6: G,p7: I,p8: H,p10: C,p9: L,p5: { object: B}}

We translate it to a JSON document:

Documentdatabase

20

Chain Fragmentation (n-hop expansion)Given the graph and this state

on Dataset Characterizer

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D

C Gp3

Chain-shaped

... ...

p3 {Bp2C,Cp3?}


Star-shaped

... ...

F {Fp6G, Fp9L,Fp8?}

p3 tends to be in chain queries with max-diameter 1, so we expand the triple Cp3G to a 1-hop fragment

BC

F G

p2p3

p6

J p11

D

p3

Cp3G will be stored

21

Chain Fragmentation (mapping)

With the expanded fragment We translate it to a set of columnar tables:

BC

F G

p2p3

p6

D

p3p2

Obj Subj

B C

p3

Obj Subj

C D

C Gp6

Obj Subj

F G

Columnardatabase

22



23

Indexing

S_PO

...

F {p10C}

C {p3G}

O_SP

...

C {Fp10}

G {Gp3}

Indexer

Each new triple is indexed by the subject and the object

It helps on a triple expansion, and to solve simple queries like:

SELECT ?x WHERE {F p10 x? }

24



25

PartitioningA

BC

E F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

P2

P1

J p11

D

P3

p10

If a graph is bigger than a server capabilities, the Rendezvous DBA can create multiple partitions

Columnardatabase

Documentdatabase

P3

P1 P2

Each NoSQL server can hold one or more partitions and each partition is in only one server.

26

Partitioning

Fragments hash

(F p10 C)Size: 2

{P1, P2}

(C p3 D)Size: 2

{P3}

(L p12 H)Size: 1

{P2}

P3 Elements

S P O

C p3 D

... ... ...

P1 Elements

S P O

A p1 B

F p10 C

...

Dictionary

P2 Elements

S P O

F p10 C

L p12 J

... ... H

(vi)Columnardatabase

Columnardatabase

Documentdatabase

P3 PnP1 P2

A

BC

E F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

P2

P1

J p11

D

P3

p10

Rendezvous manages the partitions by saving it on the dictionary

27

Partitioning (boundary replication)

Fragments hash

(F p10 C)Size: 2

{P1, P2}

(C p3 D)Size: 2

{P3}

(L p12 H)Size: 1

{P2}

P3 Elements

S P O

C p3 D

... ... ...

P1 Elements

S P O

A p1 B

F p10 C

...

Dictionary

P2 Elements

S P O

F p10 C

L p12 J

... ... H


Columnardatabase

Documentdatabase

P3 PnP1 P2

A

BC

E F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

P2

P1

J p11

D

P3

p10

If a triple is on the edge of two partitions, it will be replicated in both partitions. The size of thisboundary is defined by the DBA.

28

Partitioning (Data placement)

Fragments hash

(F p10 C)Size: 2

{P1, P2}

(C p3 D)Size: 2

{P3}

(L p12 H)Size: 1

{P2}

P3 Elements

S P O

C p3 D

... ... ...

P1 Elements

S P O

A p1 B

F p10 C

...

Dictionary

P2 Elements

S P O

F p10 C

L p12 J

... ... H


Columnardatabase

Documentdatabase

P3 PnP1 P2

A

BC

E F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

P2

P1

J p11

D

P3

p10

The fragment hash will help on the data placement. Based on the triple and the size of the fragment, Rendezvous will find the best partition to store a triple.

29

Rendezvous: Querying

● Query evaluation● Query decomposition● Caching

30

Querying evaluationGiven the graph:

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D


Q: SELECT ?x WHERE {w? p6 G .w? p7 I .w? p8 H .

x? p1 y? .y? p2 z? .z? p3 w?}

P2

P1

P3

1. It will search for:1.1. Simple queries1.2. Star queries1.3. Chain queries2. Updates the Dataset Characterizer

Chain:Qc: SELECT ?x WHERE {x? p1 y? .y? p2 z? .z? p3 w? .}

Star:Qs: SELECT ?x WHERE {w? p6 G .w? p7 I .w? p8 H }

31

Querying decompositionGiven the graph:

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D

P2

P1

P3

Chain:Q2c: SELECT ?x WHERE {x? p1 y? .y? p2 z? .z? p3 w? .}

Star:Qs: SELECT ?x WHERE {w? p6 G .w? p7 I .w? p8 H }

D: db.partition2.find({{p6:{$exists:true}, object:G},{p7:{$exists:true}, object:I},{p8:{$exists:true}, object:H},})

Partition 1:Cp1: SELECT S1, O1 FROM p1Cp1: SELECT S2, O2 FROM p2 WHERE O=S1Partition 3:Cp3: SELECT S3,O3 FROM p3 WHERE O=S2

Find the right partition using the dictionary and translates the SPARQL query to the final query to be processed by the NoSQL database.

32

Rendezvous: Querying

● Query evaluation● Query decomposition● Caching

33

Caching (two level cache)Given the graph:

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D

After the last query was issued:

Q: SELECT ?x WHERE {w? p6 G .w? p7 I .w? p8 H .

x? p1 y? .y? p2 z? .z? p3 w?

y? p5 w?}

P2

P1

P3

Near cache (in-memory tree map(

A:p1:B {A:p1:B, B:p2:C}

B:p2:C {B:p2:C, C:p3:D}

Remote cache(key/value NoSQL database)

...

A:p1:B {A:p1:B, B:p2:C}

B:p2:C {B:p2:C, C:p3:D}

...

B:p5:F {B:p5:F, F:p9:D}

Normally, the near cache is smaller than the remote cache.

34

Caching (querying)Given the graph:

A

BC

M F G

LHI

p1 p2p3

p4 p5p6

p7 p9p8

J p11

D


Q: SELECT ?x WHERE {x? p1 y? .y? p2 z? .z? p3 w? .

y? p5 F}

P2

P1

P3

Near cache (in-memory tree map(

A:p1:B {A:p1:B, B:p2:C}

B:p2:C {B:p2:C, C:p3:D}

Remote cache(key/value NoSQL database)

...

A:p1:B {A:p1:B, B:p2:C}

B:p2:C {B:p2:C, C:p3:D}

...

B:p5:F {B:p5:F, F:p9:D}This query will be solved only with triples from cache 35


○ RDF○ NoSQL



● Evaluation

36

Evaluation● LUBM: ontology for the University domain, synthetic RDF data scalable to any

size, and 14 extensional queries representing a variety of properties● Generated dataset with 4000 universities (around 100 GB and contains

around 500 million triples)● 12 queries with joins, all of them have at least one subject-subject join, and

six of them also have at least one subject-object join● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB

3.4.3, and Apache Cassandra 3.10● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity

37

Evaluation: Rendezvous performance

The bigger the number of hops (the replication), the bigger (exponentially) the size of the dataset and the loading time. However, as the joins are avoided the response time decreases.

38

Evaluation: Rendezvous different settings

Better performance when the partition is managed by Rendezvous.

The bigger is the boundary replication, the faster is the response time, without a big impact on the dataset size.

39

Evaluation: Rendezvous vs. ScalaRDF

40

Conclusions

● Rendezvous contributes on:○ Graph partitioning problem via fragments○ Better query response time through n-hop and boundary replication○ Better query response time via two-level caching○ Scalable RDF storage provided by NoSQL databases

● About the evaluation:○ Fragments are scalable○ Bigger boundaries are not necessarily related to bigger storage size○ Graph-aware partitions are better than NoSQL partitions○ Near cache is fast but it makes more difficult to keep data consistency

41

Future Work

● Formalize the query mapping○ No standard query language to rely on

● Compression of triples during the storage● Update and delete operations● Other NoSQL types (e.g., graph)● Better datasets

42

Obrigado!

Simpósio Brasileiro de Banco de Dados (SBBD)Uberlândia, Outubro/2017

Luiz Henrique Zambom SantanaProf. Dr. Ronaldo dos Santos Mello

LUBM model

44

Storing: Fragmentation

45


46


47

Storing: Partitioning

48

Evaluation: Rendezvous vs. Rainbow

49

State of the Art - SQL Triplestores

WARP Hexastore YARS 4store SPIDER RDF-3x SHARd SW-Store SOLID SPOVC S2X

50

State of the Art - NoSQL Triplestores

RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF, Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,

H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON, Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk, Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.

51

State of the Art - Triplestores

Recent survey (September 2017):

Ibrahim Abdelaziz, Razen Harbi, Zuhair Khayyat, Panos Kalnis: A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data. Proceedings of the VLDB Endowment, Volume 10, No. 13, September 2017, 2049 - 2060.

52

workload-aware rdf partitioning and sparql query caching for massive rdf graphs stored in nosql...

Technology