common crawl: an open repository of web data

19
What Does The Data World Mean to Society? Lisa Green 1 October 2012 London HUG Lisa Green 10 October 2012 Common Crawl : An Open Repository of Web Data

Upload: huguk

Post on 06-May-2015

2.190 views

Category:

Technology


4 download

DESCRIPTION

Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London

TRANSCRIPT

Page 1: Common Crawl: An Open Repository of Web Data

What Does The Data World

Mean to Society?Lisa Green

1 October 2012

London HUG

Lisa Green10 October 2012

Common Crawl : An Open Repository

of Web Data

Page 2: Common Crawl: An Open Repository of Web Data

Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg

Page 3: Common Crawl: An Open Repository of Web Data

Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg

Page 4: Common Crawl: An Open Repository of Web Data

Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg

Page 5: Common Crawl: An Open Repository of Web Data

Still NascentStill Nascent• Even cheaper storage• Even cheaper compute• Education• Open Data

Still Nascent• Even cheaper storage• Even cheaper compute• Education

Still Nascent• Even cheaper storage• Even cheaper compute

Still Nascent• Even cheaper storage

Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)

Page 6: Common Crawl: An Open Repository of Web Data

Proprietary

Commercial

Gratis

Libre

Page 7: Common Crawl: An Open Repository of Web Data

Progress

Insight

Analysis

Data

Page 8: Common Crawl: An Open Repository of Web Data

Gil Elbaz

Page 9: Common Crawl: An Open Repository of Web Data
Page 10: Common Crawl: An Open Repository of Web Data

Common Crawl Data

• ~8 Billion web pages • ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone

Page 11: Common Crawl: An Open Repository of Web Data

ARC Files - Raw Content

Metadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinks

Text Files - Text Only

http://commoncrawl.org/get-started

Page 12: Common Crawl: An Open Repository of Web Data
Page 13: Common Crawl: An Open Repository of Web Data

http://webdatacommons.org

Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%

Page 14: Common Crawl: An Open Repository of Web Data

• 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags

Page 15: Common Crawl: An Open Repository of Web Data

A corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit

research on WSD, NLP and IR.

Explicit Topic Modeling:Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept.

Given a sentence, it can help identify entities (person, location, organization) in the sentence and map them onto Wikipedia concepts.

http://wikientities.appspot.com

Page 16: Common Crawl: An Open Repository of Web Data

Mapping French websites related to Open Data

Page 17: Common Crawl: An Open Repository of Web Data

Other Use Examples

• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects

Page 18: Common Crawl: An Open Repository of Web Data

In Development

• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources

Page 19: Common Crawl: An Open Repository of Web Data

What Does The Data World

Mean to Society?Lisa Green

1 October 2012

Lisa Green

[email protected]

@commoncrawl@boudicca

Thank YouLondon HUG