national institute of advanced industrial science and technology query processing for distributed...
TRANSCRIPT
National Institute of Advanced Industrial Science and Technology
Query Processing Query Processing for Distributed RDF Databases for Distributed RDF Databases
Using a Three-dimensional Hash Index Using a Three-dimensional Hash Index
Akiyoshi MATONOAkiyoshi MATONO
[email protected]@aist.go.jp
Grid Technology Research Center,Grid Technology Research Center,AISTAIST
National Institute of Advanced Industrial Science and Technology
AgendaAgenda
Motivation & AimsMotivation & Aims
BackgroundBackground Distributed Hash Table (DHT)Distributed Hash Table (DHT)
Our approachOur approach
Performance evaluationPerformance evaluation
SummarySummary
National Institute of Advanced Industrial Science and Technology
MotivationMotivation
It is essential to describe resources using It is essential to describe resources using RDFRDF to to provide semantic tasks (e.g., resource discovery).provide semantic tasks (e.g., resource discovery).
Today, RDF data is widely used in many fields (e.g., Today, RDF data is widely used in many fields (e.g., bioinformatics and grid).bioinformatics and grid).
Thus, RDF data is scattered everywhere and the total Thus, RDF data is scattered everywhere and the total data size is rapidly increasing.data size is rapidly increasing.
We proposed a We proposed a P2PP2P-based RDF query processing.-based RDF query processing.
Providing efficient and scalable RDF query processing Providing efficient and scalable RDF query processing in a distributed environment is an important issue.in a distributed environment is an important issue.
National Institute of Advanced Industrial Science and Technology
AimsAims
RDF data is scattered everywhere.RDF data is scattered everywhere.
Provide an efficient join operation Provide an efficient join operation in a distributed environment.in a distributed environment.
The amount of data is rapidly increasing.The amount of data is rapidly increasing.
Reduce the amount of data Reduce the amount of data transferred among nodes.transferred among nodes.
Achieve scalability, availability, Achieve scalability, availability, and reliability.and reliability.
National Institute of Advanced Industrial Science and Technology
Distributed Hash Table (DHT)Distributed Hash Table (DHT)
A structured P2P network.A structured P2P network.
Achieve scalability, availability, reliability.Achieve scalability, availability, reliability.
Support only exact-match Support only exact-match lookuplookups.s.
Lookups for key-value pairs.Lookups for key-value pairs. put (key, value) , get (key)
Routing is performed in Routing is performed in O (log n) ..
Some protocols.Some protocols. Chord, Tapestry, Pastry, CAN, Kademlia Chord, Tapestry, Pastry, CAN, Kademlia
National Institute of Advanced Industrial Science and Technology
N42
N27
N11
Chord [Stoica01]Chord [Stoica01]
10-41N11N42+32
42N42N42+0keyssucc.distance
58-950-5746-4944-4543
N63N50N48N48N48
N42+4N42+2N42+1
N42+16N42+8
finger table
N2
N56
N50
N17
N63
N48
N6
N33
…
…
… …
…
…
…
put ( 28, A ) on N42
The node that is responsible for the key 28 is N11
…
27-42N27N11+16
11N11N11+0keyssucc.distance
43-10
19-2615-1813-1412
N17N11+4N17N11+2N17N11+1
N48N11+32
N27N11+8
finger table
…
28N33N27+127N27N27+0
keyssucc.distance
59-2643-5835-4231-3429-30
N63N48N42N33N33
N27+4N27+2
N27+16N27+32
N27+8
finger table
N33 is the target node
…
Key 28This data in this area is stored into Node 27
The distance to the nodes increases exponentially.
National Institute of Advanced Industrial Science and Technology
Our ApproachOur Approach
Three-dimensional hash space called “Three-dimensional hash space called “RDFCubeRDFCube”” Each axis represents hash space for one of subject, predicEach axis represents hash space for one of subject, predic
ate, and object.ate, and object. Consist of a set of cubes of the same size called “Consist of a set of cubes of the same size called “cellscells””
Bit information of RDFCube called “Bit information of RDFCube called “existence flagexistence flag”” Each cell contains a bit that indicates the present or absenEach cell contains a bit that indicates the present or absen
t of triples mapped into the cell.t of triples mapped into the cell.
Run on the top of two DHTs.Run on the top of two DHTs. RDFPeers DHTRDFPeers DHT is used to store triples. is used to store triples. RDFCube DHTRDFCube DHT is used to store bit information. is used to store bit information.
National Institute of Advanced Industrial Science and Technology
RDFCube: three-dimensional hash spaceRDFCube: three-dimensional hash space
Each axis represents hash space for one of triple’s elEach axis represents hash space for one of triple’s elements (subject, predicate, and object).ements (subject, predicate, and object).
RDFCube is composed of a set of cubes of the same RDFCube is composed of a set of cubes of the same size called “size called “cellscells”.”.
A triple is mapped into RDFCube based on the hash vA triple is mapped into RDFCube based on the hash values of elements.alues of elements.
os p
(13, 54, 39)
This triple is mapped into the point (13, 54, 39). The point is contained in the cell [0,3,2].
hash
hash
hash
subje
ct
object
predicate
39 13
54
Triple (13, 54, 39)
Cell [0, 3, 2]
subje
ct
object
predicate
National Institute of Advanced Industrial Science and Technology
Existence FlagExistence Flag
Each of cells contains a bit that indicates the Each of cells contains a bit that indicates the
present or absence of triples mapped into the cell.present or absence of triples mapped into the cell.
subje
ct
object
1
0 1 1 0
Cell Sequence[0, 1, *]
Bit Sequence 0 0 0 00 1 1 00 0 0 00 1 0 0
predicate
Cell Matrix[0, *, *]
os p
Bit Matrix
Existence Flag
Cell [0, 3, 2]
National Institute of Advanced Industrial Science and Technology
Two DHTs: RDFCube & RDFPeersTwo DHTs: RDFCube & RDFPeers
RDFPeers DHT is used to store RDF triples.RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT.RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004]Proposed by [Min Cai and Martin Frank, 2004]
RDFCube DHT is used to store bit information. RDFCube DHT is used to store bit information. Used as an index for RDFPeers.Used as an index for RDFPeers.
Storing triples.Storing triples.1.1. Store the triples to RDFPeers DHT.Store the triples to RDFPeers DHT.
2.2. Store the bit information of the triples into RDFCube DHT.Store the bit information of the triples into RDFCube DHT.
Query processing with join operation.Query processing with join operation.1.1. Get the bit information from RDFCube DHT.Get the bit information from RDFCube DHT.
2.2. Perform AND operations of the bits.Perform AND operations of the bits.
3.3. Get triples from RDFPeers DHT based on the bit information.Get triples from RDFPeers DHT based on the bit information.
National Institute of Advanced Industrial Science and Technology
RDFPeers [Cai04]RDFPeers [Cai04]
An RDF repository utilizing a DHT.An RDF repository utilizing a DHT.
We call the DHT for RDFPeers as RDFPeers DHT.We call the DHT for RDFPeers as RDFPeers DHT. KeyKey : Each of subject, predicate and object : Each of subject, predicate and object ValueValue : Triple : Triple
To store a triple into RDFPeers DHT.To store a triple into RDFPeers DHT. The triple is stored 3 times into 3 nodes by 3 lookups using The triple is stored 3 times into 3 nodes by 3 lookups using
triple’s elements as keys.triple’s elements as keys.
os p
N63N8N55
N41 N25
N4
N21
RDFPeers DHT
key: (by predicate)
value:
key: (by object)
value:
key: (by subejct)
value:
so
os p
p
os p os p
put ( , )put ( , )put ( , )
key value
s p o
s p o
s p o
s
p
o
National Institute of Advanced Industrial Science and Technology
N63N8N55
N41 N25
N4
N21
RDFPeers DHT
RDFPeers [Cai04]RDFPeers [Cai04]
An RDF repository utilizing a DHT.An RDF repository utilizing a DHT.
We call the DHT for RDFPeers as RDFPeers DHT.We call the DHT for RDFPeers as RDFPeers DHT. KeyKey : Each of subject, predicate and object : Each of subject, predicate and object ValueValue : Triple : Triple
Given a query tripleGiven a query triple Perform a lookup using one of the constants as a key.Perform a lookup using one of the constants as a key.
key: (by predicate)
value:
?s p
key: (by object)
value:
key: (by subejct)
value:
so
os p
p
os p os p
get ( ) orget ( )
key
s
p N55
N21
s
p
National Institute of Advanced Industrial Science and Technology
Two DHTs: RDFCube & RDFPeersTwo DHTs: RDFCube & RDFPeers
RDFPeers DHT is used to store RDF triples.RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT.RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004]Proposed by [Min Cai and Martin Frank, 2004]
RDFCube DHT is used to store bit information. RDFCube DHT is used to store bit information. Used as an index for RDFPeers.Used as an index for RDFPeers.
Storing triples.Storing triples.1.1. Store the triples to RDFPeers DHT.Store the triples to RDFPeers DHT.
2.2. Store the bit information of the triples into RDFCube DHT.Store the bit information of the triples into RDFCube DHT.
Query processing with join operation.Query processing with join operation.1.1. Get the bit information from RDFCube DHT.Get the bit information from RDFCube DHT.
2.2. Perform AND operations of the bits.Perform AND operations of the bits.
3.3. Get triples from RDFPeers DHT based on the bit information.Get triples from RDFPeers DHT based on the bit information.
National Institute of Advanced Industrial Science and Technology
KeyKey : ID of cell matrix : ID of cell matrix
ValueValue : Bit matrix : Bit matrix
To set a bit of cell to 1 in RDFCube DHT To set a bit of cell to 1 in RDFCube DHT Perform 3 lookups using 3 cell matrixes containing Perform 3 lookups using 3 cell matrixes containing
the cell as keys.the cell as keys.
RDFCube DHTRDFCube DHT
put ( , )put ( , )put ( , )
key value
[1, *, *]
[*, 2, *]
[*, *, 1]
[1, 2, 1]
[1, 2, 1]
[1, 2, 1]
[1, 2, 1]
[1, 2, 1][1, *, *] [*, 2, *] [*, *, 1]
N1
N15
N57
N36
N28
N51
RDFCube DHT
key:
value:
[1, *, *]
0 0 0 00 0 1 00 0 0 00 0 0 0
key:
value:
[*, 2, *]
0 0 0 00 0 0 00 0 1 00 0 0 0
key:
value:
[*, *, 1]
0 0 0 00 1 0 00 0 0 00 0 0 0
National Institute of Advanced Industrial Science and Technology
KeyKey : ID of cell matrix : ID of cell matrix
ValueValue : Bit matrix : Bit matrix
To get a bit matrix of cell matrix To get a bit matrix of cell matrix Perform a lookup using the cell matrix id as Perform a lookup using the cell matrix id as
a key.a key.
RDFCube DHTRDFCube DHT
get ( )key
[1, *, *]
[1, *, *]
N1
N15
N57
N36
N28
N51
RDFCube DHT
key:
value:
[1, *, *]
0 0 0 00 0 1 00 0 0 00 0 0 0
key:
value:
[*, 2, *]
0 0 0 00 0 0 00 0 1 00 0 0 0
key:
value:
[*, *, 1]
0 0 0 00 1 0 00 0 0 00 0 0 0
[1, *, *]
N57
[1, *, *]
0 0 0 00 0 1 00 0 0 00 0 0 0
National Institute of Advanced Industrial Science and Technology
Two DHTs: RDFCube & RDFPeersTwo DHTs: RDFCube & RDFPeers
RDFPeers DHT is used to store RDF triples.RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT.RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004]Proposed by [Min Cai and Martin Frank, 2004]
RDFCube DHT is use to store bit information. RDFCube DHT is use to store bit information. Used as an index for RDFPeers.Used as an index for RDFPeers.
Storing triples.Storing triples.1.1. Store the triples into RDFPeers DHT.Store the triples into RDFPeers DHT.
2.2. Store the bit information of the triples into RDFCube DHT.Store the bit information of the triples into RDFCube DHT.
Query processing with join operation.Query processing with join operation.1.1. Get the bit information from RDFCube DHT.Get the bit information from RDFCube DHT.
2.2. Perform AND operations of the bits.Perform AND operations of the bits.
3.3. Get triples from RDFPeers DHT based on the bit information.Get triples from RDFPeers DHT based on the bit information.
National Institute of Advanced Industrial Science and Technology
Storing TriplesStoring TriplesGiven the tripleGiven the triple
Update RDFPeers DHTUpdate RDFPeers DHTStore the triple into RDFPeers DHT by 3 lookups.Store the triple into RDFPeers DHT by 3 lookups.
Update RDFCube DHTUpdate RDFCube DHTGet the cell where the triple is mapped into.Get the cell where the triple is mapped into.
Set each bit in the 3 bit matrixes to 1 by 3 lookups.Set each bit in the 3 bit matrixes to 1 by 3 lookups.
os p
(21, 45, 17)hash cell [1, 2, 1]os p
put ( , )put ( , )put ( , )
s p o
s p o
s p o
s
p
o
key value
N63N8N55
N41 N25
N4
N21
RDFPeers DHT
key: (by predicate)
value:
key: (by object)
value:
key: (by subejct)
value:
so
os p
p
os pos p
put ( , )put ( , )put ( , )
key value
[1, *, *]
[*, 2, *]
[*, *, 1]
[1, 2, 1]
[1, 2, 1]
[1, 2, 1]
N1
N15
N57
N36
N28
N51
RDFCube DHT
[*, 2, *]
0 0 0 00 0 0 00 0 1 00 0 0 0
[*, *, 1]
0 0 0 00 1 0 00 0 0 00 0 0 0
[1, *, *]
0 0 0 00 0 1 00 0 0 00 0 0 0
National Institute of Advanced Industrial Science and Technology
Two DHTs: RDFCube & RDFPeersTwo DHTs: RDFCube & RDFPeers
RDFPeers DHT is used to store RDF triples.RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT.RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004]Proposed by [Min Cai and Martin Frank, 2004]
RDFCube DHT is used to store bit information. RDFCube DHT is used to store bit information. Used as an index for RDFPeers.Used as an index for RDFPeers.
String triples.String triples.1.1. Store the triples to RDFPeers DHT.Store the triples to RDFPeers DHT.
2.2. Store the bit information of the triples into RDFCube DHT.Store the bit information of the triples into RDFCube DHT.
Query processing with join operation.Query processing with join operation.1.1. Get the bit information from RDFCube DHT.Get the bit information from RDFCube DHT.
2.2. Perform AND operations of the bits.Perform AND operations of the bits.
3.3. Get triples from RDFPeers DHT based on the bit information.Get triples from RDFPeers DHT based on the bit information.
National Institute of Advanced Industrial Science and Technology
0 0 1 0
AND Operation
Given the queryGiven the query
1.1. Get bit information of the cells where the Get bit information of the cells where the query triples are mapped into.query triples are mapped into.
2.2. Perform AND operation between the bits.Perform AND operation between the bits.
p2 o2
o1?x
p1
Query Processing (1/2)Query Processing (1/2)
1 1 1 0
0 0 1 1
p2 B2
?x A1
?x
p1 [*, 3, 2]
[*, 1, 1]
National Institute of Advanced Industrial Science and Technology
N63N8N55
N41 N25
N4
RDFPeersDHT
N21
Query Processing (2/2)Query Processing (2/2)
key: (by predicate)
value:
p1?x A1p1
Candidateanswers
s0 p1 A0
s1 p1 A1
s2 p1 A1
s3 p1 A2
3.3. Get triples from RDFPeers DHT based on the bit infGet triples from RDFPeers DHT based on the bit informationormation
1.1. Access to a remote node where candidate answer triples Access to a remote node where candidate answer triples are stored into.are stored into.
2.2. For each triple, we check whether the bit of the cell where For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1.the triple is mapped into is equal to 1.
National Institute of Advanced Industrial Science and Technology
Narrow down the number of the candidate answers
Not Answers
3.3. Get triples from RDFPeers DHT based on the bit infGet triples from RDFPeers DHT based on the bit informationormation
1.1. Access to a remote node where candidate answer triples Access to a remote node where candidate answer triples are stored into.are stored into.
2.2. For each triple, we check whether the bit of the cell where For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1.the triple is mapped into is equal to 1.
3.3. Return the candidate answer triples that satisfy the conditReturn the candidate answer triples that satisfy the condition from the remote node.ion from the remote node.
Query Processing (2/2)Query Processing (2/2)
?x A1p1Candidateanswers
s0 p1 A0
s1 p1 A1
s2 p1 A1
s3 p1 A2
[0, 3, 2]
[1, 3, 2]
[2, 3, 2]
[3, 3, 2]
0
0
1
0
Filtering based on the bit information
National Institute of Advanced Industrial Science and Technology
Performance EvaluationPerformance Evaluation
Compare RDFPeers with RDFPeers+RDFCube Compare RDFPeers with RDFPeers+RDFCube
Data SetData Set Transform XML documents of DBLP into RDF data.Transform XML documents of DBLP into RDF data. Create 4 RDFs of different triples (12500, 25000, 50000, 100000).Create 4 RDFs of different triples (12500, 25000, 50000, 100000).
EnvironmentsEnvironments Emulate 100-node Chord network. Emulate 100-node Chord network. #divisions of RDFCube is 32-256.#divisions of RDFCube is 32-256.
QueriesQueries
CUBEPEERS
Query 1
?x
Article
“Jim Gray”
“1998”
“CoRR”
type
author
year
journal
Query 2 Query 3
?y “LNCS”title?x series ?x “VLDB2004”title
?ycrossref
title ?z
National Institute of Advanced Industrial Science and Technology
Storing PerformanceStoring Performance
• PEERS is the network costs for storing triples• CUBE is the network costs for storing triples and index construction.
• If the ratio = 2, the cost for storing triples = index construction.• If the ratio = 1, the cost for index construction is nothing.
• The ratio of #hops is smaller than 2,The cost for index construction is smaller than that for storing triples.
• The ratio of transfer data size is very close to 1, The amount of data transferred for index construction is very small.
1
1.2
1.4
1.6
1.8
2
13761 (0.57%) 26413 (1.01%) 52926 (1.81%) 103076 (3.00%)Number of triples (1-bit ratio)
Rat
io (C
UB
E /
PEE
RS)
#hopsTransfer size
1
1.2
1.4
1.6
1.8
2
32 (26.73%) 64 (6.36%) 128 (10.1%) 256 (0.14%)Division number (1-bit ratio)
Rat
io (C
UB
E /
PEE
RS)
#hopsTransfer size
National Institute of Advanced Industrial Science and Technology
Retrieval PerformanceRetrieval Performance
• PEERS is the network costs to get triples from RDFPeers DHT.• CUBE is the network costs to get bits and triples from two DHTs.
• #hops on CUBE is twice as many as that on PEERS.#hops to get triples is equal to #hops to get bit information.
• The transfer data size is reduced to at most 1/50 in query 1.Our approach makes it possible to reduce transfer size.In particular, when the query has lots of the same variables.
0
0.5
1
1.5
2
2.5
13761 (0.57%) 26413 (1.01%) 52926 (1.81%) 103076 (3.00%)
Number of triples (1-bit ratio)
Rat
io (C
UB
E /
PEE
RS)
Query1(#hops) Query2(#hops) Query3(#hops)Query1(transfer) Query2(transfer) Query3(transfer)
0
0.5
1
1.5
2
2.5
32 (26.73%) 64 (6.36%) 128 (10.1%) 256 (0.14%)
Division number (1-bit ratio)Rat
io (C
UB
E /
PEE
RS)
Query1(#hops) Query2(#hops) Query3(#hops)Query1(transfer) Query2(transfer) Query3(transfer)
National Institute of Advanced Industrial Science and Technology
ScalabilityScalability
• The ratio of CUBE to PEERS stays constant in all queries. Our approach achieves the scalability with respect to the number of triples.
0
0.5
1
13761& 64 26413 & 128 52926 & 256 103076 & 512#Triples & #Divisions
Tra
nsfe
r si
ze r
atio
(C
UB
E /
PEE
RS)
Query1Query2Query3
National Institute of Advanced Industrial Science and Technology
SummarySummary
What we have achieved.What we have achieved. Scalability with respect to #triples.Scalability with respect to #triples. Reduce the amount of data transferred among noReduce the amount of data transferred among no
des.des.
What are our major current challenges.What are our major current challenges. Provide efficient RDF query processing with join Provide efficient RDF query processing with join
operations in a distributed environment.operations in a distributed environment.
What we will achieve in the near future.What we will achieve in the near future. Eliminate redistribution of triples.Eliminate redistribution of triples. Utilize the schema information.Utilize the schema information. Dynamic division mechanism of RDFCube.Dynamic division mechanism of RDFCube.
National Institute of Advanced Industrial Science and Technology
Thank YouThank You