dbpedia ♥ commons

19
2 nd DBpedia Meeting Leipzig 03.09.2014 DBpedia Commons Gaurav Vaidya - Dimitris Kontokostas - Andrea Di Menna - Jim O'Regan

Upload: dimitris-kontokostas

Post on 22-Apr-2015

136 views

Category:

Technology


1 download

DESCRIPTION

Extract semi-structure data from Wikimedia Commons to RDF using the DBpedia Extraction Framework

TRANSCRIPT

Page 1: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

DBpedia ♥ Commons

Gaurav Vaidya - Dimitris Kontokostas - Andrea Di Menna - Jim O'Regan

Page 2: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

~23M pages like this

Page 3: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

~23M pages like this

Page 4: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

A lot of pages like this

Page 5: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Many pages like this

Page 6: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Not very similar to pages like this

Page 7: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

DBpedia Extraction Framework

✔ “Wiki agnostic”

✔ Pluggableextractors

✔ Out of the box support for common metadata

✗ Tuned for extraction in the main namespace (not File:)

✗ Many other challenges left

Page 8: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Challenges

✔ File metadata

✔ KML files

✔ Image Galleries

✔ Image Annotations

✔ Mappings Wiki

✔ Bootstrap community mappings✔ Template Statistics

✔ Licensing

✔ Technical details I'll not go into

Page 9: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Out-of-the-box support

● Categories (skos)

● External links

● Geo-coordinates

● Raw infobox properties

● Labels

● PageIds / Revisions

● Links (internal / external)

● Mappings Wiki (with some tweaking / more on that later)

Page 10: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

File metadata

● New Extractor

● New file Class hierarchy

– dbo:File, dbo:Image, dbo:StillImage, dbo:MovingImage and dbo:Sound

Sample Output:

:Aeropetes.JPG a dbo:StillImage, dbo:Image, dbo:Document, dbo:File, Work; dcterms:type dbo:StillImage dbo:fileExtension "jpg" dcterms:format "image/jpeg" dbo:fileURL commons-path:Aeropetes.JPG ; foaf:depiction commons-path:Aeropetes.JPG ; dbo:thumbnail commons-path:Aeropetes.JPG?width=300 .

Page 11: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Image Galleries

● Attach each galleryitem to the pageresource

:Colorado dbo:hasGalleryItem Colorado.JPG, Denver_Colorado_Art.jpg, ColoradoCenter1.jpg.

Page 12: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Image Annotations

● AnnotationGadget

● Boxes withoptional description

Page 13: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Image Annotations

● W3 Media Fragments recommendation

● Embed the box in the URI– ?width=15130&height=1886#xywh=pixel:10431,324,1670,1208> .

● Add descriptions in the new resource

Page 14: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Mappings Wiki

Page 15: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Template Statistics

Page 16: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Licensing

● Identified & imported automatically ~360 licence templates

● Use the mappings wiki

● Needed some hacking to make it work

– e.g. {{Self|GFDL|cc-by-sa-3.0,2.5,2.0,1.0}}

:Acraea_circeis.JPG dbo:license <http://creativecommons.org/publicdomain/mark/1.0/>

:Antepipona_deflenda_-_2012-10-17.webm dbo:license <http://creativecommons.org/licenses/by-sa/3.0/ >

Page 17: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

KML Annotations attached to media

Attach raw KML data to resource with custom extractor

Sample Output::Yellowstone_1871b.jpg dbo:hasKMLData “”” ?xml version=1.0 encoding=UTF-8?><kml xmlns=http://earth.google.com/kml/2.2”><GroundOverlay><name>Yorktown, Indiana (1878)</name><description>An 1878 map of Yorktown in Tippecanoe County, Indiana. Source: Kingman Brothers&apos; Combination Atlas Map of Tippecanoe County, Indiana, 1878.</description> <color>99ffffff</color><Icon><href>BIG_LINK_HERE</href><viewBoundScale>0.75</viewBoundScale></Icon><LatLonBox><north>40.26126145890567</north><south>40.25777915632657</south><east>-86.77033439383223</east><west>-86.77398493316619</west><rotation>-1.123009884936565</rotation></LatLonBox></GroundOverlay></kml>“”"^^rdfs:XMLLiteral .

Page 18: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Left TODOs

● Nested templates are commonly used and cannot be handled by the mappings wiki atm

– e.g. Media descriptions (although mapped) are missing{{Information |Description= {{en|Logo of the [[w:en:DBpedia|DBpedia project]]}} {{fr|Logo du projet [[w:fr:DBpedia|DBpedia]]}}

● Annotation descriptions need some tweaking

– Need to render wikitext● Put it under a SPARQL Endpoint

● Provide Linked Data

– http://commons.dbpedia.org

Page 19: DBpedia ♥ Commons

2nd DBpedia Meeting Leipzig 03.09.2014

Thank You!

Special thanks to:

● Alexandru Todor (importing the License templates)

● Google Summer of Code for sponsoring this project (Gaurav Vaidya)

Questions?

Dataset: http://nl.dbpedia.org/downloads/commonswiki Dataset samples: https://github.com/gaurav/commons-extraction