national institute of advanced industrial science and technology query processing for distributed...

National Institute of Advanced Industrial Science and Technology

Query Processing Query Processing for Distributed RDF Databases for Distributed RDF Databases

Using a Three-dimensional Hash Index Using a Three-dimensional Hash Index

Akiyoshi MATONOAkiyoshi MATONO

[email protected]@aist.go.jp

Grid Technology Research Center,Grid Technology Research Center,AISTAIST


AgendaAgenda

Motivation & AimsMotivation & Aims

BackgroundBackground Distributed Hash Table (DHT)Distributed Hash Table (DHT)

Our approachOur approach

Performance evaluationPerformance evaluation

SummarySummary


MotivationMotivation

It is essential to describe resources using It is essential to describe resources using RDFRDF to to provide semantic tasks (e.g., resource discovery).provide semantic tasks (e.g., resource discovery).

Today, RDF data is widely used in many fields (e.g., Today, RDF data is widely used in many fields (e.g., bioinformatics and grid).bioinformatics and grid).

Thus, RDF data is scattered everywhere and the total Thus, RDF data is scattered everywhere and the total data size is rapidly increasing.data size is rapidly increasing.

We proposed a We proposed a P2PP2P-based RDF query processing.-based RDF query processing.

Providing efficient and scalable RDF query processing Providing efficient and scalable RDF query processing in a distributed environment is an important issue.in a distributed environment is an important issue.


AimsAims

RDF data is scattered everywhere.RDF data is scattered everywhere.

Provide an efficient join operation Provide an efficient join operation in a distributed environment.in a distributed environment.

The amount of data is rapidly increasing.The amount of data is rapidly increasing.

Reduce the amount of data Reduce the amount of data transferred among nodes.transferred among nodes.

Achieve scalability, availability, Achieve scalability, availability, and reliability.and reliability.


Distributed Hash Table (DHT)Distributed Hash Table (DHT)

A structured P2P network.A structured P2P network.

Achieve scalability, availability, reliability.Achieve scalability, availability, reliability.

Support only exact-match Support only exact-match lookuplookups.s.

Lookups for key-value pairs.Lookups for key-value pairs. put (key, value) , get (key)

Routing is performed in Routing is performed in O (log n) ..

Some protocols.Some protocols. Chord, Tapestry, Pastry, CAN, Kademlia Chord, Tapestry, Pastry, CAN, Kademlia


N42

N27

N11

Chord [Stoica01]Chord [Stoica01]

10-41N11N42+32

42N42N42+0keyssucc.distance

58-950-5746-4944-4543

N63N50N48N48N48

N42+4N42+2N42+1

N42+16N42+8

finger table

N2

N56

N50

N17

N63

N48

N6

N33

…

…

… …

…

…

…

put ( 28, A ) on N42

The node that is responsible for the key 28 is N11

…

27-42N27N11+16

11N11N11+0keyssucc.distance

43-10

19-2615-1813-1412

N17N11+4N17N11+2N17N11+1

N48N11+32

N27N11+8

finger table

…

28N33N27+127N27N27+0

keyssucc.distance

59-2643-5835-4231-3429-30

N63N48N42N33N33

N27+4N27+2

N27+16N27+32

N27+8

finger table

N33 is the target node

…

Key 28This data in this area is stored into Node 27

The distance to the nodes increases exponentially.


Our ApproachOur Approach

Three-dimensional hash space called “Three-dimensional hash space called “RDFCubeRDFCube”” Each axis represents hash space for one of subject, predicEach axis represents hash space for one of subject, predic

ate, and object.ate, and object. Consist of a set of cubes of the same size called “Consist of a set of cubes of the same size called “cellscells””

Bit information of RDFCube called “Bit information of RDFCube called “existence flagexistence flag”” Each cell contains a bit that indicates the present or absenEach cell contains a bit that indicates the present or absen

t of triples mapped into the cell.t of triples mapped into the cell.

Run on the top of two DHTs.Run on the top of two DHTs. RDFPeers DHTRDFPeers DHT is used to store triples. is used to store triples. RDFCube DHTRDFCube DHT is used to store bit information. is used to store bit information.


RDFCube: three-dimensional hash spaceRDFCube: three-dimensional hash space

Each axis represents hash space for one of triple’s elEach axis represents hash space for one of triple’s elements (subject, predicate, and object).ements (subject, predicate, and object).

RDFCube is composed of a set of cubes of the same RDFCube is composed of a set of cubes of the same size called “size called “cellscells”.”.

A triple is mapped into RDFCube based on the hash vA triple is mapped into RDFCube based on the hash values of elements.alues of elements.

os p

(13, 54, 39)

This triple is mapped into the point (13, 54, 39). The point is contained in the cell [0,3,2].

hash

hash

hash

subje

ct

object

predicate

39 13

54

Triple (13, 54, 39)

Cell [0, 3, 2]

subje

ct

object

predicate


Existence FlagExistence Flag

Each of cells contains a bit that indicates the Each of cells contains a bit that indicates the

present or absence of triples mapped into the cell.present or absence of triples mapped into the cell.

subje

ct

object

1

0 1 1 0

Cell Sequence[0, 1, *]

Bit Sequence 0 0 0 00 1 1 00 0 0 00 1 0 0

predicate

Cell Matrix[0, *, *]

os p

Bit Matrix

Existence Flag

Cell [0, 3, 2]


Two DHTs: RDFCube & RDFPeersTwo DHTs: RDFCube & RDFPeers

RDFPeers DHT is used to store RDF triples.RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT.RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004]Proposed by [Min Cai and Martin Frank, 2004]

RDFCube DHT is used to store bit information. RDFCube DHT is used to store bit information. Used as an index for RDFPeers.Used as an index for RDFPeers.

Storing triples.Storing triples.1.1. Store the triples to RDFPeers DHT.Store the triples to RDFPeers DHT.

2.2. Store the bit information of the triples into RDFCube DHT.Store the bit information of the triples into RDFCube DHT.

Query processing with join operation.Query processing with join operation.1.1. Get the bit information from RDFCube DHT.Get the bit information from RDFCube DHT.

2.2. Perform AND operations of the bits.Perform AND operations of the bits.

3.3. Get triples from RDFPeers DHT based on the bit information.Get triples from RDFPeers DHT based on the bit information.


RDFPeers [Cai04]RDFPeers [Cai04]

An RDF repository utilizing a DHT.An RDF repository utilizing a DHT.

We call the DHT for RDFPeers as RDFPeers DHT.We call the DHT for RDFPeers as RDFPeers DHT. KeyKey : Each of subject, predicate and object : Each of subject, predicate and object ValueValue : Triple : Triple

To store a triple into RDFPeers DHT.To store a triple into RDFPeers DHT. The triple is stored 3 times into 3 nodes by 3 lookups using The triple is stored 3 times into 3 nodes by 3 lookups using

triple’s elements as keys.triple’s elements as keys.

os p

N63N8N55

N41 N25

N4

N21

RDFPeers DHT

key: (by predicate)

value:

key: (by object)

value:

key: (by subejct)

value:

so

os p

p

os p os p

put ( , )put ( , )put ( , )

key value

s p o

s p o

s p o

s

p

o


N63N8N55

N41 N25

N4

N21

RDFPeers DHT

RDFPeers [Cai04]RDFPeers [Cai04]

An RDF repository utilizing a DHT.An RDF repository utilizing a DHT.

We call the DHT for RDFPeers as RDFPeers DHT.We call the DHT for RDFPeers as RDFPeers DHT. KeyKey : Each of subject, predicate and object : Each of subject, predicate and object ValueValue : Triple : Triple

Given a query tripleGiven a query triple Perform a lookup using one of the constants as a key.Perform a lookup using one of the constants as a key.

key: (by predicate)

value:

?s p

key: (by object)

value:

key: (by subejct)

value:

so

os p

p

os p os p

get ( ) orget ( )

key

s

p N55

N21

s

p





Storing triples.Storing triples.1.1. Store the triples to RDFPeers DHT.Store the triples to RDFPeers DHT.






KeyKey : ID of cell matrix : ID of cell matrix

ValueValue : Bit matrix : Bit matrix

To set a bit of cell to 1 in RDFCube DHT To set a bit of cell to 1 in RDFCube DHT Perform 3 lookups using 3 cell matrixes containing Perform 3 lookups using 3 cell matrixes containing

the cell as keys.the cell as keys.

RDFCube DHTRDFCube DHT

put ( , )put ( , )put ( , )

key value

[1, *, *]

[*, 2, *]

[*, *, 1]

[1, 2, 1]

[1, 2, 1]

[1, 2, 1]

[1, 2, 1]

[1, 2, 1][1, *, *] [*, 2, *] [*, *, 1]

N1

N15

N57

N36

N28

N51

RDFCube DHT

key:

value:

[1, *, *]

0 0 0 00 0 1 00 0 0 00 0 0 0

key:

value:

[*, 2, *]

0 0 0 00 0 0 00 0 1 00 0 0 0

key:

value:

[*, *, 1]

0 0 0 00 1 0 00 0 0 00 0 0 0


KeyKey : ID of cell matrix : ID of cell matrix

ValueValue : Bit matrix : Bit matrix

To get a bit matrix of cell matrix To get a bit matrix of cell matrix Perform a lookup using the cell matrix id as Perform a lookup using the cell matrix id as

a key.a key.

RDFCube DHTRDFCube DHT

get ( )key

[1, *, *]

[1, *, *]

N1

N15

N57

N36

N28

N51

RDFCube DHT

key:

value:

[1, *, *]

0 0 0 00 0 1 00 0 0 00 0 0 0

key:

value:

[*, 2, *]

0 0 0 00 0 0 00 0 1 00 0 0 0

key:

value:

[*, *, 1]

0 0 0 00 1 0 00 0 0 00 0 0 0

[1, *, *]

N57

[1, *, *]

0 0 0 00 0 1 00 0 0 00 0 0 0




RDFCube DHT is use to store bit information. RDFCube DHT is use to store bit information. Used as an index for RDFPeers.Used as an index for RDFPeers.

Storing triples.Storing triples.1.1. Store the triples into RDFPeers DHT.Store the triples into RDFPeers DHT.






Storing TriplesStoring TriplesGiven the tripleGiven the triple

Update RDFPeers DHTUpdate RDFPeers DHTStore the triple into RDFPeers DHT by 3 lookups.Store the triple into RDFPeers DHT by 3 lookups.

Update RDFCube DHTUpdate RDFCube DHTGet the cell where the triple is mapped into.Get the cell where the triple is mapped into.

Set each bit in the 3 bit matrixes to 1 by 3 lookups.Set each bit in the 3 bit matrixes to 1 by 3 lookups.

os p

(21, 45, 17)hash cell [1, 2, 1]os p

put ( , )put ( , )put ( , )

s p o

s p o

s p o

s

p

o

key value

N63N8N55

N41 N25

N4

N21

RDFPeers DHT

key: (by predicate)

value:

key: (by object)

value:

key: (by subejct)

value:

so

os p

p

os pos p

put ( , )put ( , )put ( , )

key value

[1, *, *]

[*, 2, *]

[*, *, 1]

[1, 2, 1]

[1, 2, 1]

[1, 2, 1]

N1

N15

N57

N36

N28

N51

RDFCube DHT

[*, 2, *]

0 0 0 00 0 0 00 0 1 00 0 0 0

[*, *, 1]

0 0 0 00 1 0 00 0 0 00 0 0 0

[1, *, *]

0 0 0 00 0 1 00 0 0 00 0 0 0





String triples.String triples.1.1. Store the triples to RDFPeers DHT.Store the triples to RDFPeers DHT.






0 0 1 0

AND Operation

Given the queryGiven the query

1.1. Get bit information of the cells where the Get bit information of the cells where the query triples are mapped into.query triples are mapped into.

2.2. Perform AND operation between the bits.Perform AND operation between the bits.

p2 o2

o1?x

p1

Query Processing (1/2)Query Processing (1/2)

1 1 1 0

0 0 1 1

p2 B2

?x A1

?x

p1 [*, 3, 2]

[*, 1, 1]


N63N8N55

N41 N25

N4

RDFPeersDHT

N21


key: (by predicate)

value:

p1?x A1p1

Candidateanswers

s0 p1 A0

s1 p1 A1

s2 p1 A1

s3 p1 A2

3.3. Get triples from RDFPeers DHT based on the bit infGet triples from RDFPeers DHT based on the bit informationormation

1.1. Access to a remote node where candidate answer triples Access to a remote node where candidate answer triples are stored into.are stored into.

2.2. For each triple, we check whether the bit of the cell where For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1.the triple is mapped into is equal to 1.


Narrow down the number of the candidate answers

Not Answers

3.3. Get triples from RDFPeers DHT based on the bit infGet triples from RDFPeers DHT based on the bit informationormation

1.1. Access to a remote node where candidate answer triples Access to a remote node where candidate answer triples are stored into.are stored into.

2.2. For each triple, we check whether the bit of the cell where For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1.the triple is mapped into is equal to 1.

3.3. Return the candidate answer triples that satisfy the conditReturn the candidate answer triples that satisfy the condition from the remote node.ion from the remote node.


?x A1p1Candidateanswers

s0 p1 A0

s1 p1 A1

s2 p1 A1

s3 p1 A2

[0, 3, 2]

[1, 3, 2]

[2, 3, 2]

[3, 3, 2]

0

0

1

0

Filtering based on the bit information


Performance EvaluationPerformance Evaluation

Compare RDFPeers with RDFPeers+RDFCube Compare RDFPeers with RDFPeers+RDFCube

Data SetData Set Transform XML documents of DBLP into RDF data.Transform XML documents of DBLP into RDF data. Create 4 RDFs of different triples (12500, 25000, 50000, 100000).Create 4 RDFs of different triples (12500, 25000, 50000, 100000).

EnvironmentsEnvironments Emulate 100-node Chord network. Emulate 100-node Chord network. #divisions of RDFCube is 32-256.#divisions of RDFCube is 32-256.

QueriesQueries

CUBEPEERS

Query 1

?x

Article

“Jim Gray”

“1998”

“CoRR”

type

author

year

journal

Query 2 Query 3

?y “LNCS”title?x series ?x “VLDB2004”title

?ycrossref

title ?z


Storing PerformanceStoring Performance

• PEERS is the network costs for storing triples• CUBE is the network costs for storing triples and index construction.

• If the ratio = 2, the cost for storing triples = index construction.• If the ratio = 1, the cost for index construction is nothing.

• The ratio of #hops is smaller than 2,The cost for index construction is smaller than that for storing triples.

• The ratio of transfer data size is very close to 1, The amount of data transferred for index construction is very small.

1

1.2

1.4

1.6

1.8

2

13761 (0.57%) 26413 (1.01%) 52926 (1.81%) 103076 (3.00%)Number of triples (1-bit ratio)

Rat

io (C

UB

E /

PEE

RS)

#hopsTransfer size

1

1.2

1.4

1.6

1.8

2

32 (26.73%) 64 (6.36%) 128 (10.1%) 256 (0.14%)Division number (1-bit ratio)

Rat

io (C

UB

E /

PEE

RS)

#hopsTransfer size


Retrieval PerformanceRetrieval Performance

• PEERS is the network costs to get triples from RDFPeers DHT.• CUBE is the network costs to get bits and triples from two DHTs.

• #hops on CUBE is twice as many as that on PEERS.#hops to get triples is equal to #hops to get bit information.

• The transfer data size is reduced to at most 1/50 in query 1.Our approach makes it possible to reduce transfer size.In particular, when the query has lots of the same variables.

0

0.5

1

1.5

2

2.5

13761 (0.57%) 26413 (1.01%) 52926 (1.81%) 103076 (3.00%)

Number of triples (1-bit ratio)

Rat

io (C

UB

E /

PEE

RS)

Query1(#hops) Query2(#hops) Query3(#hops)Query1(transfer) Query2(transfer) Query3(transfer)

0

0.5

1

1.5

2

2.5

32 (26.73%) 64 (6.36%) 128 (10.1%) 256 (0.14%)

Division number (1-bit ratio)Rat

io (C

UB

E /

PEE

RS)

Query1(#hops) Query2(#hops) Query3(#hops)Query1(transfer) Query2(transfer) Query3(transfer)


ScalabilityScalability

• The ratio of CUBE to PEERS stays constant in all queries. Our approach achieves the scalability with respect to the number of triples.

0

0.5

1

13761& 64 26413 & 128 52926 & 256 103076 & 512#Triples & #Divisions

Tra

nsfe

r si

ze r

atio

(C

UB

E /

PEE

RS)

Query1Query2Query3


SummarySummary

What we have achieved.What we have achieved. Scalability with respect to #triples.Scalability with respect to #triples. Reduce the amount of data transferred among noReduce the amount of data transferred among no

des.des.

What are our major current challenges.What are our major current challenges. Provide efficient RDF query processing with join Provide efficient RDF query processing with join

operations in a distributed environment.operations in a distributed environment.

What we will achieve in the near future.What we will achieve in the near future. Eliminate redistribution of triples.Eliminate redistribution of triples. Utilize the schema information.Utilize the schema information. Dynamic division mechanism of RDFCube.Dynamic division mechanism of RDFCube.


Thank YouThank You

national institute of advanced industrial science and technology query processing for distributed...

Documents