exchange and consumption of huge rdf data
Post on 29-Aug-2014
4.009 Views
Preview:
DESCRIPTION
TRANSCRIPT
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Exchange and Consumption of Huge RDF Data
Miguel A. Martínez-Prieto1,2 <migumar2@infor.uva.es>
Mario Arias1,3 <mario.arias@deri.org>
Javier D. Fernández1,2 <jfergar@infor.uva.es>
1. Department of Computer Science, Universidad de Valladolid (Spain)2. Department of Computer Science, Universidad de Chile (Chile)3. Digital Enterprise Research Institute, National University of Ireland Galway
Digital Enterprise Research Institute www.deri.ie
Sharing RDF in the Web of Data.
APP
APP
RDF dump
APP
APP
SPARQL Endpoints/ APIs
APP
dereferenceable URIs
2. EX
CHANGE
APP
APP
APP
Parsing / IndexingReasoning
3. CONSUMPTION
PI
RAPP
sensor
APP
APP
APP
Data Generation
1. PUBLICATION
• Dataset analysis.• Setup a SPARQL server.• Vocabulary interlinking / integration.• Browsing and Visualization.• Exchange between servers• Data-intensive tasks.
Digital Enterprise Research Institute www.deri.ie
Dataset Exchange Workflow
1º Publication
Convert
Serialize
Compress
2º Exchange
Transfer
3ºConsumptio
nDecompres
s
Parse
Index
If RDF is meant to be machine processable,
Why are we using plain text serialization formats??
Digital Enterprise Research Institute www.deri.ie
HDT: RDF Binary Format
Compact Data Structure for RDF. W3C Submission. http://www.w3.org/Submission/2011/03/ Open Source C++/Java library.
Digital Enterprise Research Institute www.deri.ie
HDT Focused on Querying
Contribution of this paper: A complementary Index to make the HDT fully queryable. Analysis on how HDT reduces exchange and indexing
time. Evaluate in-memory query performance.
FoQ
Digital Enterprise Research Institute www.deri.ie
Dictionary
Mapping of strings to correlative IDs. {1..n} Lexicographically sorted, no duplicates. Section compression explained at [8]
Digital Enterprise Research Institute www.deri.ie
Triples Model
1 2 61 3 22 1 32 2 42 2 52 4 13 3 2
Triples
6 2 3 4 5 1 2
2 3 1 2 4 3
1 2 3S
P
O [ ][ ] [ ][ ] [ ] [ ]
[ ] [ ] [ ]
Digital Enterprise Research Institute www.deri.ie
Adjacency Lists
Operations:– access(g) = Given a global position, get the value.– findList(g) = Given a global position, get the list number.– first(l), last(l), = Given a list, find the first and last.
1 2 3 4 5 6
1 2 32 3 1 2 4 3[ , ] [ , , ] [ ]
2 3 1 2 4 3 1 0 1 0 0 1
ArrayBitmap
O(1)O(1)O(log log n)
Digital Enterprise Research Institute www.deri.ie
Triples Model and Coding
1 2 61 3 22 1 32 2 42 2 52 4 13 3 2
Triples
6 2 3 4 5 1 2
2 3 1 2 4 3 1 0 1 0 0 1
1111011
Array YBitmap Y
Array ZBitmap Z
6 2 3 4 5 1 2
2 3 1 2 4 3
1 2 3S
P
O
Digital Enterprise Research Institute www.deri.ie
Searching by Subject
1 2 61 3 22 1 32 2 42 2 52 4 13 3 2
Triples
6 2 3 4 5 1 2
2 3 1 2 4 3 1 0 1 0 0 1
1111011
Array YBitmap Y
Array ZBitmap Z
6 2 3 4 5 1 2
2 3 1 2 4 3
1 2 3S
P
O
SPO, SP?S??, S?O
( 2, 2, ? )
Digital Enterprise Research Institute www.deri.ie
Searching by Predicate
1 2 61 3 22 1 32 2 42 2 52 4 13 3 2
Triples
6 2 3 4 5 1 2
2 3 1 2 4 3 1 0 1 0 0 1
1111011
Array YBitmap Y
Array ZBitmap Z
6 2 3 4 5 1 2
2 3 1 2 4 3
1 2 3S
P
O
? P ?
( ?, 2, ? )
Digital Enterprise Research Institute www.deri.ie
Wavelet Tree
Compact Sequence of Integers {0,σ}.
access(position) = Value at position. rank(entry, position) = Number of appearances of “entry” up to “position”.
select(entry, i) = Position where “entry” appears for the i-th time.
2 3 6 3 6 1 2 3 6 2 5 2 4 1 4 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1rank(3, 7) = 2
9select(6, 3) = 9
O(log σ)O(log σ)O(log σ)
Digital Enterprise Research Institute www.deri.ie
Searching by Predicate w/ Wavelet
1 2 61 3 22 1 32 2 42 2 52 4 13 3 2
Triples
6 2 3 4 5 1 2
2 3 1 2 4 3 1 0 1 0 0 1
1111011
Wavelet YBitmap Y
Array ZBitmap Z
6 2 3 4 5 1 2
2 3 1 2 4 3
1 2 3S
P
O
? P ?
( ?, 2, ? )
Digital Enterprise Research Institute www.deri.ie
Triples: Object-Search
1 2 61 3 22 1 32 2 42 2 52 4 13 3 2
Triples
6 2 3 4 5 1 2
2 3 1 2 4 3
1 2 3S
P
O
6 2 7 3 4 5 1[ ] [ ][ ] [ ] [ ] [ ]OP-IndexO1 O2 O3 O4 O5 O6
? ? O? P O
( ?, ?, 2 )
Digital Enterprise Research Institute www.deri.ie
Data Structure Summary.
From HDT to HDT-FoQ: Convert Array Y to Wavelet. Generate OP-Index.
Triple Patterns:
SPO, SP?, S??, S?O Original HDT?P? Wavelet Tree?PO, ??O OP-Index
Digital Enterprise Research Institute www.deri.ie
Evaluation Environment
Dataset Triples Size NTriples
LinkedMDB 6,1M 850 MbDBLP 73M 11,1 GbGeonames 112M 12,3 GbDBPedia 258M 37,3 Gb
Producer:Xeon @ 2.4Ghz96GB RAM
Consumer:Phenom-II @ 3.2Ghz8GB RAM
RDF Storage
• Virtuoso• RDF-3x• Hexastore
Compressors:
• GZIP• LZMA
Datasets
Digital Enterprise Research Institute www.deri.ie
Compression Ratio
LinkedMDB
DBLP
Geonames
DBPedia
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
hdtgzlzmahdt.gzhdt.lzma
Compression ratio (% against plain ntriples)
Digital Enterprise Research Institute www.deri.ie
Publication Times
NT+GZIP NT+LZMA
HDT HDT+GZIP HDT+LZMA
linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 minDBLP 2,72 min 103 min 12 min 13,5 min 21,9 minGeonames 3,28 min 244 min 25 min 26,4 min 38,9 minDBPedia 15,9 min 466 min 56 min 60 min 121 min
linkedMDB
dblp
geonames
dbpedia
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
gz lzma hdt hdt.gz hdt.lzma
Times slower than Ntriples+GZIP
Digital Enterprise Research Institute www.deri.ie
Publication Times2
NT+GZIP NT+LZMA
HDT HDT+GZIP HDT+LZMA
linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 minDBLP 2,72 min 103 min 12 min 13,5 min 21,9 minGeonames 3,28 min 244 min 25 min 26,4 min 38,9 minDBPedia 15,9 min 466 min 56 min 60 min 121 min
linkedMDB
dblp
geonames
dbpedia
0 1 2 3 4 5 6 7 8 9 10 11 12 13
gz hdt hdt.gz hdt.lzma
Times slower than Ntriples + GZIP
Digital Enterprise Research Institute www.deri.ie
Exchange & Decompression Time
HDT+LZMA
HDT+GZIP
LZMA
GZIP
0 50 100 150 200 250 300
ExchangeDecompress
Seconds (Geometric Mean of all datasets)
*Assuming a Network Bandwidth of 2MByte/s
Digital Enterprise Research Institute www.deri.ie
Overall Client Time
HDT+GZIP+FOQ
HDT+LZMA+FOQ
GZ+RDF3x
LZMA+RDF3x
GZ+Virtuoso
LZMA+Virtuoso
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600
ExchangeDecompressIndex
Seconds (Geometric mean of all datasets)
LZMA+RDF3x HDT+LZMA
linkedMDB 2,1 min 9,21 secdblp 27 min 2,02 mingeonames 49,2 min 3,04 mindbpedia 159 min 14,3 min
Digital Enterprise Research Institute www.deri.ie
In-memory Data Store.
Less size = more data in memory = less I/O access!
Triples Index Size (Mb)Virtuoso Hexastor
eRDF3x
HDT-FoQ
LinkedMDB
6,1M 518 6976 337 68
DBLP 46M 3982 - 3252 850Geonames 112M 9216 - 6678 1435DBPedia 258M - - 15802 5260
Digital Enterprise Research Institute www.deri.ie
SP? S?O S?? ?PO ?P? ??O0123456789
10111213141516
LinkedMDB
Tim
es H
DT
Fast
erQuery Performance, Triple Patterns
SP? S?O S?? ?PO ?P? ??O0123456789
10111213141516
Geonames
RDF-3xVirtuoso
Digital Enterprise Research Institute www.deri.ie
SSbig SSsmall SObig SOsmall OObig OOsmall0
0.5
1
1.5
2
2.5
3
GeonamesRDF-3xVirtuoso
Query Performance Two-way Joins
SSbig SSsmall SObig SOsmall OObig OOsmall0
0.5
1
1.5
2
2.5
3
LinkedMDB
Tim
es H
DT
Fast
er
Digital Enterprise Research Institute www.deri.ie
Conclusions
Data is ready to be consumed 10-15x faster. Exchange time reduced. Indexing burden on server = Lightweight client processing.
Competitive query performance. Very fast on triple patterns. Joins on the same scale of existing solutions.
This is useful to you: If you need a fast, compact read-only in-memory RDF
store. If you want to share self-queryable RDF dumps. If you need fast download & query.
Addresses the volume issue of Big Data.
Digital Enterprise Research Institute www.deri.ie
Future work. Full SPARQL support.
UNION, Optional, Multiple Join. Optimized query evaluation.
Integration: Jena, Any23…
Diffussion. Get more people to use it!
Additional services on top of HDT. SPARQL Endpoint. Distributed Stream Processing. Mobile Applications.
Digital Enterprise Research Institute www.deri.ie
Thanks! http://www.rdf-hdt.org
top related