deployment of rdfa, microdata, and microformats on the web – a quantitative analysis

17
www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti- innsbruck.at Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis OC Working Group – 21.01.2014 Serge Tymaniuk

Upload: hailey

Post on 15-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. OC Working Group – 21.01.2014 Serge Tymaniuk. Overview. Introduction Methodology Results Questions. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at

Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative

AnalysisOC Working Group – 21.01.2014

Serge Tymaniuk

Page 2: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Overview

• Introduction

• Methodology

• Results

• Questions

2

Page 3: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Introduction

• Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1)

– (1) Data and Web Science Group, University of Mannheim, Germany– (2) Database Architectures Group, Centrum Wiskunde & Informatica,

Netherlands

• Features:– Analysis of RDFa, Microdata, and Microformats adoption on the

Web– Based on large public Web crawl of 3 billion HTML pages– Aims at revealing the main topical areas of the published data

and different vocabularies within each topical area– Examine structural richness (which properties are used to

described popular types of entities)

3

Page 4: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Web Crawl

• Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3.

• 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains.

• Crawling conducted between Jan. - June 2012• Compressed size of the corpus is 48TB• Relies on the PageRank algorithm

4

Page 5: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Data Extraction Process

• Parsing framework is executed on Amazon EC2

• Relies on Anything To Triples (http://any23.apache.org/) parsing library from Apache

• Rapidminer data mining framework is used for vocabulary term co-occurrence analyses

5

Page 6: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Overall picture

• Structured data was discovered within 369M out of 3B pages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%)

6

Page 7: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Deployment by FORMAT

7

* PLDs – Public Level Domains (i.e. websites)* URLs – HTML pages

Page 8: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Deployment by POPULARITY

* According to Alexa Internet Inc. (AL) list of the most frequently visited websites

8

Page 9: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Deployment by domains

9

Page 10: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Deployment on the same Website

• 93,5% of all website which has structured data use only a single format

10

Page 11: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at 11

Results: Deployment of RDFa

Most frequently used RDFa classes:

Alexa top 100 websites that use RDFa:

• IMDB• Microsoft News

Portal• BBC

Most frequently used properties co-occurring with all the 4 most frequently used OGP classes:

Page 12: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at 12

Results: Deployment of Microdata

Most frequently used Microdata classes:

Alexa top 100 websites that use Microdata:

• eBay• Microsoft Corp.• Apple Inc.

Page 13: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at 13

Results: Deployment of Microformats

Most frequently used Microformats classes:

Alexa top 100 websites that use Microformats:

• Wikipedia• Adobe• Taobao marketplace

Page 14: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Topical Domains

• Dominant Domains of the published data:

– Persons and Organizations (by all 3 formats)

– Blog- and CMS-related metadata (by RDFa and Microdata)

– Navigational metadata (by RDFa and Microdata)

– Product data (by all 3 formats)

– Event data (by Microformats)

14

Page 15: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Results: Structural Richness

• Only a small set of generic properties is used to describe entities:

– Instances of OGP class “Product” are described by title, url, site_name, description in most classes

– Instances of Schema class “Product” is described largely only by name and description.

Additional extraction techniques has to be employed for deeper understanding

15

Page 16: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Sources

16

1. Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: http://hannes.muehleisen.org/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf

Page 17: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis

www.sti-innsbruck.at

Thank you for your attention!

17

Questions?