linked data for information extraction challenge - tasks and results @ iswc 2014

Linked Data for Information Extraction Challenge 2014 Tasks and Results Robert Meusel and Heiko Paulheim

Upload: robert-meusel

Post on 19-Jun-2015




0 download


The Linked Data for Information Extraction challenge explores aims at extracting structured data from Web pages. It is based on a subset of the Web Data Commons Microformats dataset. For the challenge, original annotated pages are provided, as well as the triples extracted from them. Based on that information, participants have to design an Information extraction system for extracting that information from other web pages. In this year's challenge, we focus on hCard data, i.e., information about persons. The use case of such a system could be the assembly of a large database on person data. The systems are evaluated on a test set of annotated web pages, from which all annotations have been removed. The participants have to extract triples from those pages and send in their resulting triple files. The submitted files are evaluated against the gold standard of the original triples, ranking the solutions by F-measure.


Page 1: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extraction Challenge 2014

Tasks and ResultsRobert Meusel and Heiko Paulheim

Page 2: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 2


- Training dataset was created from HTML pages, which are annotated using Microformats hCard.

- The data is a subset of the WebDataCommons Microformats Dataset.

- The original data is provided by the Common Crawl Foundation, the largest public available collection of web crawls

Creation of an information extraction system that scrape structured information from HTML web sites.

Page 3: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 3

The Common Crawl Foundation (CC)

- Non-profit foundation dedicated to building and maintaining an open crawl of the Web

- 9 crawl corpora from 2008 till 2014 available so far

- Crawling Strategies: • Earlier crawled using BFS (with link discovery) seeded with a large list of ranked

Seeds (PageRank), current crawls are gathered using a >6billion URL seed list from the blekko search index

• By this, all crawls represent the popular part of the Web

- Data availability• CC provides three different datasets for each crawl

• All data can be freely downloaded from AWS S3

Page 4: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 4

The WebDataCommons Project

- Extracts information annotated with the Markup languages Microformats, Microdata and RDFa

- Till now, three different datasets gathered from crawls of 2010, 2012, and 2013

Extraction of Structured Data from the Common Crawl Corpora




Page 5: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 5

Extracting the Data

- Webmaster markup their information within the HTML page directly using one of the three markup languages

- Using Any23 ( those information are extracted as RDF triples

1. _:node1 <> <> .

2. _:node1 <> "Predator Instinct FG Fu\u00DFballschuh"@de .

3. _:node1 <> <> .

4. _:node1 <> "\u20AC 219,95"@de .5. _:node1 <> "EUR"@de .6. …


Page 6: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 6

The Original Dataset of 2013

- Over 1.7 million domains using at least one markup language

- Over 17 billion quads with over 4 billion records (typed entities)

- hCard the most dominant among domains

Page 7: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 7

Extraction of Challenge Dataset

- Selected a subset of over 10k web pages from the corpus including over 450k extracted triples (annotated with MF hCard)• Training: 9 877 web pages / 373 501 triples

• Test: 2 379 web pages / 85 248 triples

Page 8: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 8

Creation of the Gold Standard

- Input: Annotated HTML Pages & Triples (extracted with Any23)

- After extraction of triples, all hCard tags are replaced• Replacement by random generated tags

• stable per page, but different across pages

• Replacement of comments: as CMS systems like to comment <!– here is the name of the company -->

- Output• Training:

• Annotated HTML Page• Cleaned HTML Page• Triples

• Testing:• Cleaned HTML Page• Triples (not public)

Page 9: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 9

Overview: Dataset Creation and Evaluation Process

Page 10: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 10


- Methodology: We consider each triple within extracted statements (submission) and extracted statements (Any23 from original test HTML pages) as equal if they have the same predicate and object for one page.

- Baseline: Each page has at least one statement declaring there is one VCard

_:1 rdf:type hcard:Vcard .

Page 11: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 11

Challenge Results

- We got one submission (which you will learn about in some minutes)

- The submission outperforms the baseline for Recall and F-Measure

- The Gold Standard is not perfect, as within the data, we also find names and other attributes without a giving a type (whenever webmasters did not model this) Even a perfect extraction system would not reach a precision of 1.

Page 12: Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extractin Challenge 2014 - Task and Results 12

Outlook: LD4IE Challenge 2015

- Include more classes (e.g. Microdata and/or RDFa)

- Add negative examples to generate a more realistic setting• as today, systems can assume there is something within the test sample

• challenge of making sure, that in the negative examples there is no not marked data included

- Improve representativity of the challenge dataset• Wide-spread CMS systems automatically allow marking up of articles, posts etc.

• Eliminate such bias, if present for next challenges










