heuristics for fixing common errors in deployed schema.org microdata

25
Heuristics for Fixing Common Errors in Deployed schema.org Microdata Robert Meusel and Heiko Paulheim

Upload: robert-meusel

Post on 28-Jul-2015

177 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Robert Meusel and Heiko Paulheim

Page 2: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

2

Motivation

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015

Page 3: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 3

Microdata in a Nutshell

- Adding structured information to web pages• By marking up contents and entities

- Arbitrary vocabularies are possible • Practically, only schema.org is deployed on a large scale

• Plus its historical predecessor: data-vocabulary.org

- Similar to RDFa

<div itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="name">Data and Web Science Group</span> <span itemprop="addressLocality">Mannheim</span>, <span itemprop="postalCode">68131</span> <span itemprop="addressCountry">Germany</span></div>

Page 4: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 4

Schema.org in a Nutshell

- Vocabulary for marking up entities on web pages• 675 classes and 965 properties (as of May 2015, release 2.0)

- Promoted and consumes by major search engine companies• Google, Bing, Yahoo!, and Yandex

• Google Rich Snippets

- Community-driven evolution and development

- Can be used with Microdata and RDFa• Hardly used together with RDFa (<0.1% of RDFa-using websites [1])

[1] http://webdatacommons.org/structureddata/2014-12/stats/stats.html

Page 5: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 5

Schema.org in a Nutshell – Coverage

- Schema.org has incorporated some popular vocabularies, like:• Good Relations (2012)

• W3C BibExtend (2014)

• MusicBrainz vocabulary (2015)

• Automotive Ontology (2015)

Page 6: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

6

Microdata with Schema.org in HTML Pages

<html>…<body>…<div id="main-section" class="performance left" data-sku="M17242_580“>

<h1> Predator Instinct FG Fußballschuh </h1><div>

<meta content="EUR"><span data-sale-price="219.95">219,95</span>…</body></html>

HTML pages embed directly markup languages to annotate items using different vocabularies

<html>…<body>…<div id="main-section" class="performance left" data-sku="M17242_580" itemscope itemtype="http://schema.org/Product"><h1 itemprop="name"> Predator Instinct FG Fußballschuh </h1><div itemscope itemtype="http://schema.org/Offer" itemprop="offers"><meta itemprop="priceCurrency" content="EUR"><span itemprop="price" data-sale-price="219.95">219,95</span>…</body></html>

1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> .

2._:node1 <http://schema.org/Product/name> "Predator Instinct FG Fußballschuh"@de .

3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Offer> .

4._:node1 <http://schema.org/Offer/price> "219,95"@de .

5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" .

6.…

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015

Page 7: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 7

So Far, So Good …

- Schema is well explained on the schema.org websites

- Data providers are supported by validation tools (e.g. Yandex structured data validator) when deploying

- Win-Win for both sides

- Plus: Data is (mostly) free accessible in the Web

…. but:

- >100.000s of data providers, which are mostly no schema.org experts or evangelists

- Validators & schema might help but there is no need to use them

Page 8: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 8

So What Could Possibly Go Wrong?

- Usage of wrong namespaces• http./schema.org

- Usage of undefined types• http://schema.org/Breadcrumb

- Usage of undefined properties• http://schema.org/postID

- Confusion of datatype properties and object properties• _:n1 s:address “Jump Street 21”

- Property domain and range violations• _:n1 a s:Product_:n1 s:price “for free”

Page 9: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 9

Compiling a Schema.org Dataset

- Starting point: all pages in the CommonCrawl that contain Microdata

- What could be (meant to be) schema.org?• Everything that contains “schema.org” as substring in a namespace

• Everything that contains URIs where the protocol and authority is similar to “http://schema.org/” (with an EditDistance of 1)

• Filter noise: removing all namespaces that occur only on one website

Final corpus consists of:6.4 billion triples

extracted from over 217 billion pagesbelonging to 398,542 data providers

which is 86% of all Microdata in the corpus.

Page 10: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 10

Namespace Violations

- More than 98% of the preselected pages use a correct namespace

- Frequent namespace variations:• http://www.schema.org/

• https://schema.org

• http:/schema.org

• http://SChema.org

Debated!

Page 11: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 11

Undefined Types

- Used by around 6% of all data providers

- Typical causes:• Misspellings: http://schema.org/Stores

• Miscapitalization: http://schema.org/localbusiness

- Comparison to LOD Compliance• 5.8% of all Microdata documents

• 38.8% of all LOD documents (Hogan et al., 2010)

…/Store

…/LocalBusiness

Page 12: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 12

Undefined Properties

- Used by around 4% of all data providers

- Typical Causes:• Miscapitalization: http://schema.org/contentURL

• Close but miss: http://schema.org/currencyhttp://schema.org/fax

• Made up: http://schema.org/blogIdhttp://schema.org/postId

- Comparison to LOD Compliance• 9.7% of all Microdata documents

• 72.4% of all LOD documents (Hogan et al., 2010)

…/contentUrl

…/priceCurrency

Page 13: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 13

Confusion of Object Properties with Data Properties

- i.e. using an object property with a string values

- Used by over 56.6% of all data providers

- Typical properties:• http://schema.org/addresscountry

• http://schema.org/manufacturer

• http://schema.org/author

• http://schema.org/brand

- Comparison to LOD Compliance• 24.35% of all Microdata documents

• 8% of all LOD documents (Hogan et al., 2010)

Page 14: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 14

Confusion of Data Properties with Object Properties

- i.e. using a data property with a complex object

- Used by less than 0.2% of all data providers

- Comparison to LOD Compliance• 0.6% of all Microdata documents

• 2.2% of all LOD documents (Hogan et al., 2010)

Page 15: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 15

Property Domain Violations

- i.e. using a property with a subject not included in its domain

- Used by 4% of all data providers

- Typical violations are mainly shortcuts• s:price used on s:Product

• s:streetAddress used on s:LocalBusiness

- Comparison to LOD Compliance:• Difficult to compare as semantics are different

• List of schema.org domains is exhaustive

• LOD: open world assumption

s:Product s:Offer s:price

s:LocalBusiness s:PostalAddress s:streetAddress

Page 16: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 16

Data Property Range Violations

- i.e. using a data property with an incompatible literal

- Used by 9.6% of all data providers

- 20 most common violations:• 13 dates

• 3 Urls

• 2 numbers

• 2 times

- Comparison to LOD Compliance:• 12.06% of all Microdata documents

• 4.6% of all LOD documents (Hogan et al., 2010)

“a month ago”

“2 pieces”

“last week”

Page 17: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 17

Object Property Range Violations

- i.e. using an object property with a type outside its range

- Used by 8.6% of all data providers

- Typical violations:• s:mainContentOfPage with s:Blog instead of s:WebPageElement

- Comparison to LOD Compliance• 3.2% of all Microdata documents

• 2.4% of all LOD documents (Hogan et al., 2010)

Maybe a hint at a missing hierarchy relation?

Page 18: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 18

Schema.org Compliance Summary

- Surprisingly high level of compliance

- Providers are often not technology evangelists (unlike in LOD)• Anybody can start publishing Microdata annotated HTML

- Most often higher than for LOD• Except for the confusion of data and object properties

But still the number of erroneous pages could prevent data consumers to make use of the annotated data and understand

the semantics.

Page 19: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 19

Identifying and Fixing Wrong Namespaces

- Main errors due to missing slashes, wrong protocol and capitalization

- Simple rules to handle wrong namespaces• Removal of www

• Replacement of https by http

• Conversion to lower case

• Adding of missing slashes and removal of prefixes before schema.org

- Impact:• 147 of 148 wrongly spelled namespaces could be fixed

Page 20: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 20

Handling Undefined Types and Properties

- Main errors due to wrong capitalization

- Heuristic: Ignore capitalization when parsing entities from web pages, and replace the schema element with the properly capitalized version

- Impact (together with namespace fixes):• Correct type replacement within 71% of all data providers

• Correct property replacement within 65% of all data providers

• Remaining data providers account for over 70% of all undefined types and properties and are hard-to-detect typos

Page 21: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 21

Handling Object Properties with Literal Values

- Main objects modeled as literals are s:Organization, s:Person and s:PostalAddress

- Manually inspecting those values for the object properties s:author, s:creator and s:address

- Impact• The heuristic could replace all misused

object properties on 92,449 data providers

• Might lead to changes in the type distribution

• E.g. 14 million new entities of type s:PostalAddress

_:1 s:author “Robert” ._:1 s:author _:2 ._:2 a s:Person ._:2 s:name “Robert” .

Page 22: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 22

Handling Property Domain Violations

- Main cause are shortcuts

- Heuristic to find theproperty R and type Tfor a domain violationof property s:r: One unique solution for only one of the two patterns:

- Impact:• 31% of erroneous data providers could be fixed

• No solution or multiple solutions for the rest

_:1 “5”s:aggregatedRating

s:aggregatedRating is not defined for type of _:1

_:2s:aggregatedRating

Type?

Property?

R s:domainIncludes s:t .R s:rangeIncludes T .s:r s:domainIncludes T .

R s:rangeIncludes s:t .R s:domainIncludes T .s:r s:domainIncludes T .

Page 23: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

23

Heuristics Summary

- Over 410 million wrong triples could be corrected

- Over 700 million missing triples could be added

- Corrections affected in total over 115.000 data providers• ~ 28% of all data providers in the data set

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015

Page 24: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 24

LD4IE Challenge @ ISWC 2015

Learn to annotate entities on HTML pages using already annotated pages as training set.

- Deadline: 2015-07-15

- Challenge Page: goo.gl/laF6yl

- Contact: Heiko Paulheim ([email protected])

Good Luck!

Page 25: Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Heuristics for Fixing Common Errors in Deployed schema.org Microdata - Meusel and Paulheim - ESWC 2015 25

Thank you! Questions? Feedback?

Data and more insights can be found at:

http://webdatacommons.org/structureddata/2013-11/stats/fixing_common_errors.html

More interesting datasets and analysis can be found at the website of WebDataCommons:

http://webdatacommons.org/index.html

Acknowledgement

The extraction and analysis of the datasets was supported by AWS in Education Grant.