the web data commons microdata, rdfa, and microformat dataset series @ iswc2014

13
The WebDataCommons Microdata, RDFa, and Microformat Dataset Series Robert Meusel , Petar Petrovski, and Christian Bizer

Upload: robert-meusel

Post on 03-Jul-2015

382 views

Category:

Education


4 download

TRANSCRIPT

Page 1: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

The WebDataCommonsMicrodata, RDFa, and Microformat

Dataset Series

Robert Meusel, Petar Petrovski, and Christian Bizer

Page 2: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

2

HTML-embedded Structured Data on the Web

More and more websites semantically markup the content of their HTML pages.

RDFa

Microdata

Microformats

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 3: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

3

Dataset Creation

Common Crawl Foundation Corpora of 2010, 2012 and 2013• Snapshot of popular pages of the Web

• Continuously new crawls available

Parsing the HTML pages using Apache Any23• Using a distributed framework on 100 parallel EC2 instances

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-

ns#type> <http://schema.org/Product> .

2. _:node1 <http://schema.org/Product/name>

"Predator Instinct FG Fu\u00DFballschuh"@de .

3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-

ns#type> <http://schema.org/Offer> .

4. _:node1 <http://schema.org/Offer/price> "\u20AC

219,95"@de .

5. _:node1 <http://schema.org/Offer/priceCurrency>

"EUR"@de .

6. …

Any23

The framework is easy to adapt and is publicly available at:http://webdatacommons.org/framework/

Page 4: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

4

Dataset Series Overview

Series contains three datasets from 2010, 2012 and 2013

All together over 30 billion RDF quads

Each dataset is again split into subsets including quads extracted for a particular markup language

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 5: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

5

Overview of 2013 dataset

Over 1.7 million domains using at least one markup language

Over 17 billion quads with over 4 billion records (typed entities)

hCard still most dominant among domains

Microdata contains the largest number of quads

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 6: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

6

Divergence in Class and Property Usage in 2013

Small number of classes and properties is used by a large number of domains

RDFa: 646k classes and 27k properties, but <1k classes and ~2k properties are used by at least two different domains

MD: 15k classes and 170k properties, but ~1.2k classes and <13k properties are used by at least two different domains.

Classes and Properties used by solely one domain are mostly typos

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 7: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

7

RDFa Insights 2013

Usage of various vocabularies to describe information:• Strong presents of Open Graph Protocol (e.g. Facebook)

• FOAF and SIOC (Blog-Software as Drupal)

Largest topics covered are:• Articles and Documents (Blogs and News portals)

• Products, Reviews and Ratings

• Organizations

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 8: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

8

Microdata Insights 2013 and 2012

Clear increase of development in comparison to 2012

Still two vocabularies deployed: data-vocabulary and schema.org

Largest topical areas:• Postal Addresses and Locations• Products, Offers and Ratings• Organizations and Persons• Articles and Blogs• Breadcrumb

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 9: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

9

Focus on Schema.org/Product

One of the largest public available product collections

Almost 100 million records described with name, offer and image

34 million records contain a further description

11% of all product records include a brand

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 10: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

10

Microformats Insights 2013

Most dominant vocabulary is hCard

Still a very solid deployment

Topics are:• Persons & Organizations• Events• Products and reviews• Recipes

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 11: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

11

Opportunities & Challenges

Opportunities

Vast amounts of free data, created from people all over the world

Large topical coverage from broad areas (as products) to niche (as recipes)

High up-to-dateness of information, as popular pages potentially update their content frequently

Challenges

Data quality assessment, as the data is created by experts and rookies

Further information extraction, as a flat schema and rather low number of properties are used

Identity resolution, as the data does hardly contain identifiers

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 12: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

12

Possible Application Domains

Enriching existing knowledge bases• E.g. mapping DBPedia Classes and Properties to the corresponding classes and

properties within the available vocabularies to add missing information and extend entity knowledge

• As shown by Lehmberg et al. winner of the Semantic Web Challenge (Big Data Track) 2014, this data can be used as additional source (besides others) to gather and return wider search results

Design and adaption of algorithms and methods to face the characteristics of such web data• Training of data extraction methods to gather not marked data within the HTML

pages • Further extraction of additional information from the raw data, e.g. extraction of

skills, requirements etc. from job posting descriptions

Starting point for further data discovery• The dataset can be used as starting points for further data crawling, as not all

pages from a domain are included (in most of the cases)

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series

Page 13: The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

13

Thank you! Questions? Feedback?

Acknowledgement

The extraction and analysis of the datasets was supported by AWS in Education Grant and the EU FP7 project LOD2. Special thanks to SWSA for supporting the travel to ISWC 2014.

Data and more statistics can be found at:http://webdatacommons.org/structureddata/index.html

More interesting datasets and analysis can be found at the website of WebDataCommons:

http://webdatacommons.org/index.html

The WebDataCommons Microdata, RDFa, and Microformats Dataset Series