gdg meets u event - big data & wikidata - no lies codelab
DESCRIPTION
RDF triples informations on Wikipedia. Making SPARQL queries on DBpedia endpointTRANSCRIPT
![Page 1: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/1.jpg)
BigData & Wikidata - no liesSPARQL queries on DBPedia
Camelia Boban
![Page 2: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/2.jpg)
BigData & Wikidata - no lies
![Page 3: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/3.jpg)
Resources for the codelab:
Eclipse Luna for J2EE developers - https://www.eclipse.org/downloads/index-developer.php
Java SE 1.8 - http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html
Apache Tomcat 8.0.5 - http://tomcat.apache.org/download-80.cgi
Axis2 1.6.2 - http://axis.apache.org/axis2/java/core/download.cgi
Apache Jena 2.11.1 - http://jena.apache.org/download/
Dbpedia Sparql endpoint: - dbpedia.org/sparql
BigData & Wikidata - no lies
![Page 4: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/4.jpg)
JAR needed:
httpclient-4.2.3.jar httpcore-4.2.2.jar Jena-arq-2.11.1.jar
Jena-core-2.11.1.jar Jena-iri-1.0.1.jar jena-sdb-1.4.1.jar jena-tdb-1.0.1.jar
slf4j-api-1.6.4.jar slf4j-log4j12-1.6.4.jar
xercesImpl-2.11.0.jar xml-apis-1.4.01.jar
Attention!!
NO jcl-over-slf4j-1.6.4.jar (slf4j-log4j12-1.6.4 conflict, “Can’t override final class exception”)
NO httpcore-4.0.jar (made by Axis, httpcore-4.2.2.jar conflict, don’t let create the WS)
BigData & Wikidata - no lies
![Page 5: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/5.jpg)
The Semantic WebThe Semantic Web is a project that intends to add computer-processable
meaning (semantics) to the Word Wide Web.
SPARQL
A a protocol and a query language SQL-like for querying RDF graphs via pattern matching
VIRTUOSO
Both back-end database engine and the HTTP/SPARQL server.
BigData & Wikidata - no lies
![Page 6: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/6.jpg)
BigData & Wikidata - no lies
![Page 7: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/7.jpg)
DBpedia.org
Is the Semantic Web mirror of Wikipedia.
RDFIs a data model of graphs on subject, predicate, object triples.
APACHE JENA
A free and open source Java framework for building Semantic Web and Linked Data applications.
ARQ - A SPARQL Processor for Jena for querying Remote SPARQL Services
BigData & Wikidata - no lies
![Page 8: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/8.jpg)
BigData & Wikidata - no lies
![Page 9: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/9.jpg)
DBpedia.org extracts from Wikipedia editions in 119 languages, convert
it into RDF and make this information available on the Web:
★ 24.9 million things (16.8 million from the English Dbpedia);
★ labels and abstracts for 12.6 million unique things;
★ 24.6 million links to images and 27.6 million links to external
web pages;
★ 45.0 million external links into other RDF datasets, 67.0 million
links to Wikipedia categories, and 41.2 million YAGO categories.
BigData & Wikidata - no lies
![Page 10: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/10.jpg)
The dataset consists of 2.46 billion RDF triples (470 million were
extracted from the English edition of Wikipedia), 1.98 billion from other
language editions, and 45 million are links to external datasets.
DBpedia uses the Resource Description Framework (RDF) as a
flexible data model for representing extracted information and for
publishing it on the Web. We use the SPARQL query language to query
this data.
BigData & Wikidata - no lies
![Page 11: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/11.jpg)
BigData & Wikidata - no lies
![Page 12: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/12.jpg)
What is a Triple?
A Triple is the minimal amount of information expressable in Semantic Web. It is composed of 3 elements:
1. A subject which is a URI (e.g., a "web address") that represents
something.
2. A predicate which is another URI that represents a certain
property of the subject.
3. An object which can be a URI or a literal (a string) that is related
to the subject through the predicate.
BigData & Wikidata - no lies
![Page 13: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/13.jpg)
John has the email address
(subject) (predicate) (object)
Subjects, predicates, and objects are represented with URIs, which can
be abbreviated as prefixed names.
Objects can also be literals: strings, integers, booleans, etc.
BigData & Wikidata - no lies
![Page 14: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/14.jpg)
Why SPARQL?
SPARQL is a quey language of the Semantic Web that lets us:
1. Extract values from structured and semi-strutured data
2. Explore data by querying unknown relatioships
3. Perform complex join query of various dataset in a unique query
4. Trasform data from a vocabulary in another
BigData & Wikidata - no lies
![Page 15: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/15.jpg)
Structure of a SPARQL query:
● Prefix declarations, for abbreviating URIs ( PREFIX dbpowl: <http://dbpedia.org/ontology/Mountain> = dbpowl:Mountain)
● Dataset definition, stating what RDF graph(s) are being queried (DBPedia, Darwin Core Terms, Yago, FOAF - Friend of a Friend)
● A result clause, identifying what information to return from the query The query pattern, specifying what to query for in the underlying dataset (Select)
● Query modifiers, slicing, ordering, and otherwise rearranging query results - ORDER BY, GROUP BY
BigData & Wikidata - no lies
![Page 16: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/16.jpg)
BigData & Wikidata - no lies
![Page 17: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/17.jpg)
##EXAMPLE - Give me all cities & towns in Abruzzo with more than 50,000 inhabitants
PREFIX dbpclass: <http://dbpedia.org/class/yago/> PREFIX dbpprop: <http://dbpedia.org/property/> SELECT ?resource ?value WHERE { ?resource a dbpclass:CitiesAndTownsInAbruzzo . ?resource dbpprop:populationTotal ?value . FILTER ( ?value > 50000 ) } ORDER BY ?resource ?value
BigData & Wikidata - no lies
![Page 18: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/18.jpg)
BigData & Wikidata - no lies
![Page 19: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/19.jpg)
Some PREFIX:
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX txn: <http://lod.taxonconcept.org/ontology/txn.owl#>
BigData & Wikidata - no lies
![Page 20: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/20.jpg)
DBPEDIA
----------------------------------------------------------------------------------
PREFIX dbp: <http://dbpedia.org/>
PREFIX dbpowl: <http://dbpedia.org/ontology/>
PREFIX dbpres: <http://dbpedia.org/resource/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX dbpclass: <http://dbpedia.org/class/yago/>
BigData & Wikidata - no lies
![Page 21: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/21.jpg)
Wikipedia articles consist mostly of free text, but also contain different
types of structured information: infobox templates,
categorisation information, images, geo-coordinates, and links to
external Web pages. DBpedia transforms into RDF triples data that are
entered in Wikipedia. So creating a page in Wikipedia creates RDF in
DBpedia.
BigData & Wikidata - no lies
![Page 22: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/22.jpg)
BigData & Wikidata - no lies
![Page 23: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/23.jpg)
Example:
https://en.wikipedia.org/wiki/Pulp_Fiction describes the movie. DBpedia
creates a URI: http://dbpedia.org/resource/wikipedia_page_name (where
wikipedia_page_name is the name of the regular Wikipedia html page)
= http://dbpedia.org/page/Pulp_Fiction. Underscore characters replace
spaces.
DBpedia can be queried via a Web interface at ttp://dbpedia.org/sparql .
The interface uses the Virtuoso SPARQL Query Editor to query the
DBpedia endpoint.
BigData & Wikidata - no lies
![Page 24: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/24.jpg)
Public SPARQL Endpoint - use OpenLink Virtuoso
Wikipedia page: http://en.wikipedia.org/wiki/Pulp_Fiction
DBPedia resource: http://dbpedia.org/page/Pulp_Fiction
InfoBox: dbpedia-owl:abstract; dbpedia-owl:starring; dbpedia-owl:
budget; dbpprop:country; dbpprop:caption ecc.
For instance, the figure below shows the source code and the
visualisation of an infobox template containing structured information
about Pulp Fiction.
BigData & Wikidata - no lies
![Page 25: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/25.jpg)
Big&Wikidata - no lies
![Page 26: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/26.jpg)
Big&Wikidata - no liesPREFIX prop: <http://dbpedia.org/property/>
PREFIX res:<http://dbpedia.org/resource/>
PREFIX owl:<http://dbpedia.org/ontology/>
SELECT DISTINCT ?name ?abstract ?caption ?image ?budget ?director ?cast ?country ?
category
WHERE {
res:Pulp_Fiction prop:name ?name ;
owl:abstract ?abstract ;
prop:caption ?caption;
owl:thumbnail ?image;
owl:budget ?budget ;
owl:director ?director ;
owl:starring ?cast ;
prop:country ?country ;
dcterms:subject ?category .
FILTER langMatches( lang(?abstract), 'en').
}
![Page 27: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/27.jpg)
Big&Wikidata - no lies
...
![Page 28: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/28.jpg)
Linked Data is a method of publishing RDF data on the Web and of interlinking data between different data sources.
Query builder:
➢ http://dbpedia.org/snorql/
➢ http://querybuilder.dbpedia.org/
➢ http://dbpedia.org/isparql/
➢ http://dbpedia.org/fct/
➢ http://it.dbpedia.org/sparql
Prefix variables start with "?"
BigData & Wikidata - no lies
![Page 29: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/29.jpg)
The current RDF vocabularies are available at the following locations:
➔ W3: http://www.w3.org/TR/vcard-rdf/ vCard Ontology - for
describing People and Organizations
http://www.w3.org/2003/01/geo/ Geo Ontology - for spatially-
located things
http://www.w3.org/2004/02/geo/ SKOS Simple Knowledge
Organization System
BigData & Wikidata - no lies
![Page 30: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/30.jpg)
➔ GEO NAMES: http://www.geonames.org/ geospatial semantic information (postal code)
➔ DUBLIN CORE: http://www.dublincore.org/ defines general metadata attributes used in a particular application
➔ FOAF: http://www.foaf-project.org/ Friend of a Friend, vocabulary for describing people
➔ UNIPROT: http://www.uniprot.org/core/, http://beta.sparql.uniprot.org/uniprot for science articles
BigData & Wikidata - no lies
![Page 31: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/31.jpg)
➔ MUSIC ONTOLOGY: http://musicontology.com/, provides terms for describing artists, albums and tracks.
➔ REVIEW VOCABULARY: http://purl.org/stuff/rev , vocabulary for representing reviews.
➔ CREATIVE COMMONS (CC): http://creativecommons.org/ns , vocabulary for describing license terms.
➔ OPEN UNIVERSITY: http://data.open.ac.uk/
BigData & Wikidata - no lies
![Page 32: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/32.jpg)
➔ Semantically-Interlinked Online Communities (SIOC): www.sioc-project.org/, vocabulary for representing online communities
➔ Description of a Project (DOAP): http://usefulinc.com/doap/, vocabulary for describing projects
➔ Simple Knowledge Organization System (SKOS): http://www.w3.org/2004/02/skos/, vocabulary for representing taxonomies and loosely structured knowledge
BigData & Wikidata - no lies
![Page 33: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/33.jpg)
BigData & Wikidata - no lies
![Page 34: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/34.jpg)
SPARQL queries have two parts (FROM is not indispensable):
1. The query (WHERE) part, which produces a list of variable bindings
(although some variables may be unbound).
2. The part which puts together the results. SELECT, ASK, CONSTRUCT,
or DESCRIBE.
Other keywords:
UNION, OPTIONAL (optional display if data exists), FILTER
(conditions), ORDER BY, GROUP BY
BigData & Wikidata - no lies
![Page 35: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/35.jpg)
SELECT - is effectively what the query returns (a ResultSet)
ASK - just looks to see if there are any results
COSTRUCT - uses a template to make RDF from the results. For each
result row it binds the variables and adds the statements to the result
model. If a template triple contains an unbound variable it is skipped.
Return a new RDF-Graph
DESCRIBE - unusual, since it takes each result node, finds triples
associated with it, and adds them to a result model. Return a new RDF-
Graph
BigData & Wikidata - no lies
![Page 36: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/36.jpg)
What linked data il good for? Don’t search a single thing, but explore a whole set of related things together!
1) Revolutionize Wikipedia Search
2) Include DBpedia data in our own web page
3) Mobile and Geographic Applications
4) Document Classification, Annotation and Social Bookmarking
5) Multi-Domain Ontology
6) Nucleus for the Web of Data
7) Support Wikipedia Authors with Editing Suggestions
BigData & Wikidata - no lies
![Page 37: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/37.jpg)
BigData & Wikidata - no lies
![Page 39: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/39.jpg)
WIKIPEDIA DUMPS
● Arabic Wikipedia dumps: http://dumps.wikimedia.org/arwiki/
● Dutch Wikipedia dumps: http://dumps.wikimedia.org/nlwiki/
● English Wikipedia dumps: http://dumps.wikimedia.org/enwiki/
● French Wikipedia dumps: http://dumps.wikimedia.org/frwiki/
● German Wikipedia dumps: http://dumps.wikimedia.org/dewiki/
● Italian Wikipedia dumps: http://dumps.wikimedia.org/itwiki/
● Persian Wikipedia dumps: http://dumps.wikimedia.org/fawiki/
● Polish Wikipedia dumps: http://dumps.wikimedia.org/plwiki/
BigData & Wikidata - no lies
![Page 40: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/40.jpg)
WIKIPEDIA DUMPS
● Portuguese Wikipedia dumps: http://dumps.wikimedia.org/ptwiki/
● Russian Wikipedia dumps: http://dumps.wikimedia.org/ruwiki/
● Serbian Wikipedia dumps: http://dumps.wikimedia.org/srwiki/
● Spanish Wikipedia dumps: http://dumps.wikimedia.org/eswiki/
● Swedish Wikipedia dumps: http://dumps.wikimedia.org/svwiki/
● Ukrainian Wikipedia dumps: http://dumps.wikimedia.org/ukwiki/
● Vietnamese Wikipedia dumps: http://dumps.wikimedia.org/viwiki/
BigData & Wikidata - no lies
![Page 41: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/41.jpg)
LINK
Codelab’s project code: http://github.com/GDG-L-Ab/SparqlOpendataWS
http://dbpedia.org/sparql & http://it.dbpedia.org/sparql
http://wiki.dbpedia.org/Datasets
http://en.wikipedia.org/ & http://it.wikipedia.org/
http://dbpedia.org/snorql, http://data.semanticweb.org/snorql/ SPARQL Explorer
http://downloads.dbpedia.org/3.9/ & http://wiki.dbpedia.org/Downloads39
BigData & Wikidata - no lies
![Page 42: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/42.jpg)
Projects that use linked data:
JAVA: Open Learn Linked data: free access to Open University course
materials
PHP: Semantic MediaWiki -Ll
lets you store and query data within the wiki's pages.
PEARL: WikSAR
PYTHON: Braindump - semantic search in Wikipedia
RUBY: SemperWiki
Lets you store and query data within the wiki's page
BigData & Wikidata - no lies
![Page 43: GDG Meets U event - Big data & Wikidata - no lies codelab](https://reader036.vdocuments.net/reader036/viewer/2022081414/54c67d0a4a795993528b45c1/html5/thumbnails/43.jpg)
BigData & Wikidata - no liesTHANK YOU! :-)
I AM
CAMELIA BOBAN
G+ : https://plus.google.com/u/0/+cameliaboban
Twitter : http://twitter.com/GDGRomaLAb
LinkedIn: it.linkedin.com/pub/camelia-boban/22/191/313/Blog: http://blog.aissatechnologies.com/Skype: [email protected]