weso caepia-20111108
DESCRIPTION
TRANSCRIPT
Query Expansion Methods and Performance Evaluation
for Reusing Linking Open Data of the
European Public Procurement Notices
Code: TSI-020100-2010-919
José María Álvarez RodríguezWESO-Universidad de Oviedohttp://purl.org/weso/moldeas/
Tecnologías de Linked Data y sus aplicaciones en España (TLDE)CAEPIA 2011-Tenerife (Spain)
8th of November, 2011
OverviewUse case & Context
SPARQL & Performance
Next Steps
Objective
Creation of a pan -european e-procurement platform
Covering almost every publicprocurement notices of the
European regions
E-procurement Long Tail
TEDBOE
(official bulletin of the Spanish Governement) BOPA
(official bulletin of the Asturian Governement)
To Be Able to answer to …
Which public procurement notices are relevant to Dutch companies (only SMEs) that
want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch -
speaking region of Flanders (Belgium)?
XML
TEDTED
RDFizing
CPVCPV
Services
(e.g. Searching,
Matchmaking &
Prediction)BOEBOE
……
NUTSNUTS
Organizations
BOPABOPA
RDFizing
EurovocEurovoc
Linked Data
Api
Pubby+Snorql
1
2
3
4
5
Structuring public procurement notices
Transforming government classifications
LOD enrichment
Providing new semantic-based services
Easing the access to thepublished data using the
LOD approach
Semantic
Methods
1,2,3 Preliminary Results
Information Triples Total
Common ProcurementVocabulary(2003 y 2008)
~300,00 ~11 millions of
RDF triples
Organizations ~5,000,000
NUTS 36,219
Public procurementnotices(2008-2011)
677,058
2,398,601
2,590,880
402,264
4 Semantic -basedServices
Problem of«Query Expansion » depending on the kind of
information variable
4 Methods of«QueryExpansion »
Expansion
Individual
Taxonomy-based
Directly
Syntactic Search
SpreadingActivation
Recommendingengine
Location
Georeasoning
User-based
Numeric
Fuzzy Logic
History-based
Correlation
Group
Recommendingengine
Remembering …
Which public procurement notices are relevant to Dutch companies (only SMEs) that
want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch -
speaking region of Flanders (Belgium)?
Query …
?ppn
NUTS-B3 300 RÉG. WALLONNE
cpv:45221111-3
SME170,000 €
cpv:45221111-3 NL
ppn:nutsCode
cpv:CodeIn2008
org:classification ppn:hasAmount
2 years
ppn:hasDuration
Applying Query Expansion …
cpv:45221111-3 NL
?ppn
SME170,000-200,000 €
ppn:nutsCode
cpv:CodeIn2008
org:classification ppn:hasAmount
2-3 years
ppn:hasDuration cpv:45221111-3cpv:45221110-6cpv:45221113-7cpv:45221114-4
NUTS-B3NUTS-NL326NUTS-1025NUTS-BE2
4 Example of SPARQL query
SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = cpv:45221111-3 ... ) .FILTER (
(xsd:double(?amount) >= xsd:long(170,000)) && (xsd:double(?amount) <= xsd:long(200,000)) ).
. FILTER(?nutsCode = nuts:B3 ... ) .FILTER (
(xsd:long(?duration) >= xsd:long(2)) && (xsd:long(?duration) <= xsd:long(3)) ).
}
Context
Performance of SPARQL Queries
~30 sec.
Hardware & Software
DELL PC 2GB RAM and 30GB HardDiskVirtual Box (version 4.0.6)
Linux 2.6.35-22-server #33-Ubuntu 2 SMP x86_64 GNU/Linux
Ubuntu 10.10
OpenLink Virtuoso Opensource-6-20110218
Question ?
How to decrease the time of query execution without
modify the hardware and not use any vendor feature?
TripleStore
25 graphs20 M of RDF Triples
But…
8 graphs11 M of RDF Triples
Focus on..
The generation of SPARQL queries
Let’s start …
9 SPARQL Queries
3 executions
Ti Simple Enhanced LIMIT FILTER GRAPHS Split Parallel Total
queries
T1 * 1
T2 * * 1
T3 * 1
T4 * * 1
T5 * * * 1
T6 * * * * * 4
T6-1 * * * * * * 4
T7 * * * * 5
T7-1 * * * * * 5
T8 * * * * * 20
T8-1 * * * * * * 20
T-9 * * * * * 15
T-91 * * * * * 15
T10 * * * * * 60
T10-1 * * * * * * 60
Simple SPARQL query
SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = cpv:15331137 ) .
. FILTER(?nutsCode = nuts:UK ) .}
Simple Query
1 CPV Code1 NUTS Code
Time: ~3,29 sec.
1
T1
Rewrite SPARQL queries:Match triples from specific to
general
Filter as soon as possible
T2
Use the LIMIT clause
Value set to 10,000
Rewrite SPARQL query
SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn cpv:codeIn2008 ?cpvCode. FILTER(? cpvCode = cpv:15331137 ) .?ppn ppn:nutsCode ?nutsCode.FILTER(?nutsCode = nuts:UK ) .?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.
. } LIMIT 10000
Results T22
1 CPV Code1 NUTS Code
Time: ~3,26 sec.
Evaluation
There is no significant changes in execution time
and gain…and
We are interested in “enhanced queries ”
T3
Execution of enhancedqueries
Enhanced SPARQL query
SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = {cpv:15331137 , cpv:48611000,
cpv: 48611000, cpv:50531510, cpv: 15871210 }) .. FILTER(?nutsCode = {nuts:B3, nuts:PL, nuts:RO ) .}
5 CPV Codes3 NUTS Codes
1 query
3 Results T3
Time: ~20,65 sec.
T4
Rewrite SPARQL queries+
Use the LIMIT clause
5 CPV Codes3 NUTS Codes
1 query
4 Results T4 wrt T3
Time: ~20,55 sec.
Info
8 graphs
11 M of RDF Triples
T5
Rewrite SPARQL queries+
Use the LIMIT clause+
Named Graphs (FROM)
5 CPV Codes3 NUTS Codes
1 query
5 Results T5 wrt T3
Time: ~20,65 sec.
T6
Rewrite SPARQL queries+
Use the LIMIT clause+
Named Graphs (FROM)+
Split into simple queries
5 CPV Codes3 NUTS Codes
4 Graphs4 simple queries
6 Results T6 wrt T3
Time: ~20,60 sec.
T6-1
Rewrite SPARQL queries+
Use the LIMIT clause+
Named Graphs (FROM)+
Split enhance query into simple queries+
Parallelization of query execution (ad-hoc map/reduce)
5 CPV Codes3 NUTS Codes
4 Graphs4 simple queries
6-1 Results T6-1 wrt T3
Time: ~11,93 sec.
T7
Rewrite SPARQL queries+
Use the LIMIT clause+
Split enhance query into simple queries
1 CPV Code (5)3 NUTS Code
5 simple queries
7 Results T7 wrt T3
Time: ~15,81 sec.
T7-1
Rewrite SPARQL queries+
Use the LIMIT clause+
Split enhance query into simple queries+
Parallelization of query execution (ad-hoc map/reduce)
1 CPV Code (5)3 NUTS Codes
5 simple queries
7-1 Results T7-1 wrt T3
Time: ~10,55 sec.
T8
Rewrite SPARQL queries+
Use the LIMIT clause+
Named Graphs (FROM)+
Split into simple queries
1 CPV Code (5)3 NUTS Codes
4 Graphs20 simple queries
8 Results T8 wrt T3
Time: ~32,34 sec.
T8-1
Rewrite SPARQL queries+
Use the LIMIT clause+
Named Graphs (FROM)+
Split enhance query into simple queries+
Parallelization of query execution (ad-hoc map/reduce)
1 CPV Code (5)3 NUTS Codes
4 Graphs20 simple queries
8-1 Results T8-1 wrt T3
Time: ~18,45 sec.
T9
Rewrite SPARQL queries+
Use the LIMIT clause+
Split enhance query into simple queries (1 CPV code+1 NUTS code)
1 CPV Code (5)1 NUTS Code (3)
15 simple queries
9 Results T9 wrt T3
Time: ~22,462 sec.
T9-1
Rewrite SPARQL queries+
Use the LIMIT clause+
Split enhance query into simple queries (1 CPV code+1 NUTS code)
+Parallelization of query execution
(ad-hoc map/reduce)
1 CPV Code (5)1 NUTS Code (3)
15 simple queries
9-1 Results T9-1 wrt T3
Time: ~12,77 sec.
T10
Rewrite SPARQL queries+
Use the LIMIT clause+
Named Graphs (FROM)+
Split into simple queries(1 CPV code+1 NUTS code )
1 CPV Code (5)1 NUTS Code (3)
4 Graphs60 simple queries
10 Results T10 wrt T3
Time: ~71,17 sec.
T10-1Rewrite SPARQL queries
+Use the LIMIT clause
+Named Graphs (FROM)
+Split enhance query into simple queries
(1 CPV code+1 NUTS code )+
Parallelization of query execution (ad-hoc map/reduce)
1 CPV Code (5)1 NUTS Code (3)
4 Graphs60 simple queries
10-1 Results T10-1 wrt T3
Time: ~35,13 sec.
Ti Table of ResultsTime (sec.) Gain (%)
T1 3,29 N/AT2 3,26 0,93T3 20,65 N/AT4 20,55 0,49T5 20,65 0T6 20,6 0,24T6-1 11,93 73,09T7 15,81 30,61T7-1 10,55 95,73
T8 32,34 -36,15T8-1 18,45 11,92T9 22,62 -8,71T9-1 12,77 61,71T10 71,63 -71,17
T10-1 35,13 -41,22
Discussion
• The number of queries is a key-factor• The number of CPV codes implies more
execution time• The parallelization improves execution
time• T7-1 is the best execution in terms of
time• Rewrite SPARQL queries• Use the LIMIT clause• Split enhance query into simple queries • Parallelization of query execution
Further Steps
• Distribute graphs in different nodes (HW improvement)
• Use of other triple stores • (SW comparison)• Add SPARQL 1.1 new features
(Expressiveness improvement)• Cache of queries (SW improvement)
SomeReferences …
• http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index. html#comparison
• http://www.slideshare.net/olafhartig/an-overview-on -linked-data-management-and-sparql-querying-isslod2011
• http://squin.sourceforge.net/• http://www2.informatik.hu-
berlin.de/~hartig/files/Slides_Hartig_ISSLOD2011.pd f• http://www2008.org/papers/pdf/p595-stocker1.pdf• http://www.informatik.uni-
freiburg.de/~mschmidt/docs/diss_final01122010.pdf• http://mayor2.dia.fi.upm.es/oeg-upm/files/sparql-dq p/eswc11-bac-ext.pdf• http://www.slideshare.net/olafhartig/the-sparql-que ry-graph-model-for-
query-optimization-1259536• http://www.w3.org/TR/sparql-features/
Query Expansion Methods and Performance Evaluation
for Reusing Linking Open Data of the
European Public Procurement Notices
Code: TSI-020100-2010-919
José María Álvarez RodríguezWESO-Universidad de Oviedohttp://purl.org/weso/moldeas/
Tecnologías de Linked Data y sus aplicaciones en España (TLDE)CAEPIA 2011-Tenerife (Spain)
8th of November, 2011