evolving the web into a global dataspace – advances and applications

76
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 1 Prof. Dr. Christian Bizer Evolving the Web into a Global Dataspace - Advances and Applications - 18th International Conference on Business Information System (BIS 2015)

Upload: chris-bizer

Post on 31-Jul-2015

592 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 1

Prof. Dr. Christian Bizer

Evolving the Web into a Global Dataspace

- Advances and Applications -

18th International Conference on Business Information System (BIS 2015)

Page 2: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 2

Hello

Professor Christian Bizer

University of Mannheim

Research Topics

Web Technologies

Web Data Profiling

Web Data Integration

Web Mining

Page 3: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 3

Data and Web Science Group @ University of Mannheim

6 Professors• Heiner Stuckenschmidt

• Rainer Gemulla

• Christian Bizer

• Simone Ponzetto

• Heiko Paulheim

• Johanna Völker

25 researchers and PhD students

http://dws.informatik.uni-mannheim.de/

1. Research methods for integrating and mining large amounts of heterogeneous information from the Web.

2. Empirically analyze the content and structure of the Web.

Page 4: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 4

Querying the Classic Web

DBHTML

Page 5: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 5

Long Standing Goal

Query the Web like a single,

global database

Page 6: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 6

2001 Article: The Semantic Web

Envisions three things to happen:

1.people publish data in structured form in addition to HTML pages on the Web

2.common vocabularies / ontologies are used to represent data

3.people implement cool applications that do smart things with the available data

Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web. Scientific American, May 2001.

Page 7: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 7

14 Years Later

There are 1.5 million publications about the Semantic Web on Google Scholar, but

1. Do people publish structured data on the Web?

2. Do people agree on common vocabularies / ontologies?

3. What are the cool applications that do smart things with the data?

Page 8: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 8

Outline

1. Semantic Annotations in HTML Pages

2. Linked Data

3. Knowledge Graphs

4. Conclusions

Page 9: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 9

1. Semantic Annotations in HTML Pages

Simple idea: Help machines to understand Web content by marking up data in HTML pages.

<div itemtype="http://schema.org/Hotel">

<span itemprop="name">Vienna Marriott Hotel</span>

<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">

<span itemprop="streetAddress">Parkring 12a</span>

<span itemprop="addressLocality">Vienna</span>

<span itemprop="addressCountry">Austria</span>

</span>

<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">

<span itemprop="ratingValue"> 4 </span> stars-based on

<span itemprop="reviewCount"> 250 </span> reviews.

</div>

Page 10: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 10

Semantic Annotation Formats

Microformats

Microdata

RDFa

date back to 2003

small set of fixed formats

W3C Recommendation in 2008

can represent any type of data

proposed in 2009

tries to be simpler than RDFa

Page 11: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 11

Open Graph Protocol

allows site owners to determine how entities are displayed in Facebook

relies on RDFa for marking up data in HTML pages

available since April 2010

Page 12: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 12

Schema.org

ask site owners since 2011 to annotate data for enriching search results

675 Types: Event, Place, Local Business, Product, Review, Person Encoding: Microdata, RDFa, JSON-LD

Page 13: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 13

Usage of Schema.org Data @ Google

Rich snippetswithin

search results

Page 14: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 14

Event Data in Google Applications

https://developers.google.com/structured-data/

Page 15: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 15

Flight Offers in Google Search Results

Annotated webpages

directly below Google

Flights results

Page 16: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 16

Rich-Snippets Get More User Attention

Suchen

Source: www.looktracker.com

Potential business incentive.

Page 17: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 17

Motivation for Semantic Annotations

Study by searchmetrics.com in 2013: 10.000s of search keywords

Type of rich-snippet displayed by Google:

Source: http://www.searchmetrics.com/de/knowledge-base/schema/

Google displays Rich-Snippets for 40% of all queries.

Page 18: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 18

The Common Crawl

Page 19: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 19

The Web Data Commons Project

extracts all Microformat, Microdata, RDFa data from the Common Crawl

analyzes and provides the extracted data for download

four extraction runs so far• 2014 CC Corpus: 2.0 billion HTML pages 20.4 billion RDF triples

• 2013 CC Corpus: 2.2 billion HTML pages 17.2 billion RDF triples

• 2012 CC Corpus: 3.0 billion HTML pages 7.3 billion RDF triples

• 2009/2010 CC Corpus: 2.5 billion HTML pages 5.1 billion RDF triples

uses 100 machines on Amazon EC2 • approx. 3000 machine/hours

(spot instances of type c3.xlarge) 550 Euro

http://www.webdatacommons.org/structureddata/

Page 20: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 20

Overall Adoption 2014

620 million HTML pages out of the 2 billion pages provide semantic annotations (30%).

2.72 million pay-level-domains (PLDs) out of the 15.68 million pay-level-domains covered by the crawl provide annotations (17%).

Google, 2014*:5 million websites provide Schema.org data.

* Guha in LDOW2014 Keynote

Page 21: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 21

Number of PLDs providing Semantic Annotations

Page 22: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 22

Most Popular Classes

RDFa

Microdata

Page 23: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 23

Topical Focus – Microdata 2014

2014 2013

Class Instances # PLDs PLDs

# % # %

1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,04

2 schema:Article 54.972.000 88,7 10,82% 65.930 14,22

3 schema:Blog 3.787.000 110,663 13,50% 64.709 13,96

4 schema:Product 288.083.000 89,608 10,93% 56.388 12,16

5 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,31

6 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,53

7 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,94

8 schema:Offer 236.953.000 62,849 7,66% 35.635 7,69

9 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,61

10 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,92

11 schema:Organization 101.769.000 52,733 6,43% 24.255 5,23

12 schema:Person 115.376.000 47,936 5,85% 21.107 4,55

13 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,47

14 dv:Product 12.411.000 16,003 1,95% 13.844 2,99

15 schema:Review 42.561.000 20,124 2,45% 13.137 2,83

16 dv:Review-aggregate 3.964.000 14,094 1,72% 13.075 2,82

17 dv:Organization 3.155.000 10,649 1,30% 9.582 2,07

18 dv:Offer 7.170.000 11,64 1,42% 9.298 2,01

19 dv:Address 2.138.000 9,674 1,18% 8.866 1,91

20 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8

Top Classes

Topics:• CMS and blog

metadata

• products and offers

• ratings and reviews

• business listings

• address data

• ...and a massive long tail

schema: = Schema.orgdv: = Google Rich Snippet Vocabulary (deprecated)

Page 24: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 24

Adoption by E-Commerce Websites

Distribution by Alexa Top-15 Shopping Sites Top-Level Domain

TLD #PLDs com 38344 co.uk 3605 net 1813 de 1333 pl 1273 com.br 1194 ru 1165 com.au 1062 nl 1002

Website schema:ProductAmazon.com Ebay.com NetFlix.com Amazon.co.uk Walmart.com etsy.com Ikea.com Bestbuy.com Homedepot.com Target.com Groupon.com Newegg.com Lowes.com Macys.com Nordstrom.com

Adoption by Top-15: 60 %

Page 25: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 25

Properties used to Describe Products

Top 15 Properties PLDs# %

schema:Product/name 78,292 87 %schema:Product/image 59,445 66 %schema:Product/description 58,228 65 %schema:Product/offers 57,633 64 %schema:Offer/price 54,290 61 %schema:Offer/availability 36,789 41 %schema:Offer/priceCurrency 30,610 34 %schema:Product/url 23,723 26 %schema:Product/aggregateRating 21,166 24 %schema:AggregateRating/ratingValue 20,513 23 %schema:AggregateRating/reviewCount 14,930 17 %schema:Product/manufacturer 10,150 11 %schema:Product/brand 9,739 11 %schema:Product/productID 9,221 10 %schema:Product/sku 7955 9 %

Page 26: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 26

Adoption by Travel Websites

Top 15 Travel Websites schema:Hotel Any ClassBooking.com (uses DataVoc) TripAdvisor Expedia Agoda Hotels.com Kayak Priceline Travelocity Orbitz ChoiceHotels HolidayCheck ChoiceHotels InterContinental Hotels Group Marriott International Global Hyatt Corp.

Adoption: 73 %

Page 27: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 27

Properties used to Describe Hotels

Top 10 Properties PLDs# %

schema:Hotel/name 4173 88,35 %schema:Hotel/address 3311 70,10 %schema:Hotel/telephone 2488 52,68 %schema:PostalAddress/streetAddress 2362 50,01 %

schema:PostalAddress/addressLocality 2231 47,24 %

schema:Hotel/url 2102 44,51 %schema:PostalAddress/postalCode 2096 44,38 %schema:AggregateRating/ratingValue 1952 41,33 %

schema:Hotel/aggregateRating 1866 39,51 %

schema:AggregateRating/bestRating 1697 35,93 %

Page 28: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 28

Adoption by Job Websites

Distribution by Top-10 Employment SitesTop-Level Domain

Adoption by Top-10: 70 %

TLD #PLDsjobs 908com 828org 263co.uk 194net 40nl 38ca 33de 32jobs 908

Website schema:JobPostingIndeed.com Monster.com Careerbuilder.com Snagajob.com Jobsdb.com Jobsearch.about.com Jobs.net Internships.com Jobs.aol.com Quintcareers.com

Page 29: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 29

Properties used to Describe Job Postings

Top 10 Properties PLDs# %

JobPosting/title 2588 91.16 %JobPosting/hiringOrganization 1412 49.74 %

JobPosting/description 1192 41.99 %JobPosting/jobLocation 1062 37.41 %

Organization/name 862 30.36 %JobPosting/datePosted 793 27.93 %

Place/address 471 16.59 %JobPosting/baseSalary 227 8.00 %

JobPosting/industry 209 7.36 %JobPosting/educationRequirements 145 5.11 %

Page 30: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 30

Class / Property Distribution

Only a small set ofclasses / propertiesis used.

Strong focus onSchema.org andFacebook vocabularies.

schema.org675 classes

965 properties

Page 31: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 31

Opportunity 1: Search Engine Optimization

Get richer visibility in search results and potentially more clicks.

Page 32: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 32

Opportunity 2: Change Push to Pull Communication

Current situation:• Information providers need to

push data into multiple channels

• multiple search engines

• multiple domain-specific portals

Web approach:• You maintain a website

• All interested parties crawl your data

Page 33: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 33

Opportunity 3: Applications beyond Rich-Snippets

E-Commerce• Rich source of product data, offers, and reviews

• Opportunity to build global product catalogs

• Opportunity to mine product and rating data on global-scale

Tourism• Additional data for tourism applications: Nearby local businesses, nearby

landmarks, nearby hospitals, nearby events

• Search engines as new competitors put pressure on large booking portals?

Recruitment• Increased market transparency

• Search engines as new competitors put pressure on job portals that charge per posting?

High up-to-dateness of data• as original data providers know about changes first

Page 34: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 34

Main Challenge: Data Integration and Cleansing

The schema is standardized, but

1. entity names differ

2. the schema is rather shallow and a rather low number of properties is used

3. data quality differs as the data is created by experts and rookies

Page 35: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 35

Property PLDs# %

schema:Product/name 78,292 87%schema:Product/description 58,228 65%

schema:Product/manufacturer 10,150 11%schema:Product/brand 9,739 11%

schema:Product/productID 9,221 10%

Looking Deeper into the E-Commerce Data

1. The structure of the data is rather shallow• Product features are encoded in titles and descriptions

• Example product name:“Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB”

• Example product description:“Faster Flash Storage with 64 GB Solid State Drive and USB 3.0 …”

• Product IDs are provided by only 10% of the websites

• Categorization information is provided only by 2% of the websites.

Page 36: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 36

Categorization of Product Offers

We analyzed 1.9 million product offers from 9200 shops We trained bag-of-words classifier for 9 product categories

on product descriptions from Amazon.

Source: Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering Microdata Markup. In: 4th Workshop on Data Extraction and Object Search (DEOS2014)  @ WWW2014

Page 37: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 37

Identity Resolution for Electronic Products

We trained feature extractors for product descriptions on offers for electronic products from Amazon.

We used the Silk framework for identity resolution.

Precision= 85%

Page 38: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 38

Starting Points for Further Improvements

Identity Resolution• Exploit product identifiers to learn better product recognizers

• 10% of the websites (9,221 PLDs) use s:Product/productID

• 1% of the websites (935 PLDs) use s:Product/gtin13

Categorization of Products• Exploit categorization information provided by subset of the websites

• 1,5% of the websites (1,497 PLDs) use s:Offer/category

• 0,5% of the websites (460 PLDs) use s:WebPage/breadcrumb

• Challenge: Integration of ~ 2,000 product taxonomies

Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables > Dining Tables Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables > Dining Tables

Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys > over $60Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys > over $60

Page 39: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 39

Conclusion: Semantic Annotations in HTML Pages

1. Wide-spread adoption of semantic annotations• motivated by mayor search engines

2. Strong ontology agreement driven by data consumers• Schema.org, Open Graph Protocol

3. Main application: Rich-snippets

4. Endless data pool for• Commercial applications

• product and travel data integration and mining

• up-to-date listings of local businesses

• job search engines that increase market transparency

• Research

• large-scale data integration and mining

• information extraction (using annotations as distant supervision*)

* Foley, et al.: Learning to Extract Local Events from the Web. SIGIR 2015

Page 40: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 40

Download and Play with the Data

http://www.webdatacommons.org/structureddata/

Only tip of the iceberg, as each website is only partly crawled.

Page 41: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 41

2. Linked Data

B C

RDF

RDFlink

A D E

RDFlinks

RDFlinks

RDFlinks

RDF

RDF

RDF

RDF

RDF RDF

RDF

RDF

RDF

• by using RDF to publish structured data directly on the Web

• by setting links between data items within different data sources.

Set of best practices for publishing structured data on the Web in the form of a single global data graph.

Page 42: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 42

Links as Integration Hints

publishing Identity Links on the Web

publishing Vocabulary Links on the Web

<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>

owl:sameAs

<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .

<http://xmlns.com/foaf/0.1/Person>

owl:equivalentClass

<http://dbpedia.org/ontology/Person> .

Page 43: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 43

Effort Distribution between Publisher and Consumer

Publishers or third parties provides

identity/vocabulary links

Consumer mines missing identity/vocabulary links

Effort Distribution

Page 44: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 44

LOD Datasets on the Web: April 2014

Growth without new category Social Networking: 94 %

Source: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. In: 13th International Semantic Web Conference (ISWC2014).

Page 45: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 45

Uptake in the Government Domain

Various efforts by public sectorinstitutions world-wide

Forerunners• UK government

• US government

Types of data published• statistical data

• environmental data

• budget and election data

Goals• Make data available to the public and

other government agencies

• Ease data integration by using standards, providing unique identifiers and by setting links

Page 46: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 46

Uptake in the Libraries Community

Institutions publishing Linked Data• Library of Congress (subject headings)

• German National Library (PND dataset and subject headings)

• Swedish National Library (Libris - catalog)

• Hungarian National Library (OPAC and digital library)

• Europeana Digital Library (4 million artifacts)

• Springer (metadata about conference proceedings)

Goals: 1. Interconnect resources between repositories

(by topic, by location, by historical period, by ...)

2. Integrate library catalogs on global scale

Page 47: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 47

Uptake in the Life Science Domain

Goals: 1. Connect life science datasets

in order to support

• biological knowledge discovery

• drug discovery

2. Reuse results of previous integration efforts

Page 48: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 48

Uptake in the Linguistic Research Community

http://linguistic-lod.org/llod-cloudhttp://www.lider-project.eu/

Page 49: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 49

Ontological Agreement

Strong agreement on some vocabularies

Proprietary vocabularies are used inaddition to common ones, as data is often very specific

Widely-Used Vocabularies

Proprietary Vocabularies

Page 50: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 50

RDF Links

Some datasets put a lot of effort into linking

Many datasets only link to a small number of other datasets or do not set RDF links at all

Datasets with Top In-Degrees Out-Degrees per Category

Page 51: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 51

RDF Links in the LOD Cloud: August 2014

Page 52: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 52

Page 53: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 53

Page 54: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 54

Linked Data as Background Knowledge for Data Mining

Which factors correlate with unemployment in France?

Page 55: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 55

Unemployment Table with Additional Attributes

Page 56: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 56

RapidMiner Linked Open Data Extension

Allows you to 1. link local table to LOD data sources

2. extend local table with additional attributes

3. mine extended tables using all Rapidminer features

Page 57: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 57

Finding Correlations

Use additional attributes to find interesting correlations

Example correlation for unemployment in France:• African islands, islands in the Indian Ocean,

outermost regions of the EU (positive)

• Population growth (positive)

• Energy consumption (negative)

• Hospital beds/inhabitants (negative)

• Fast food restaurants (positive)

• Police stations (positive)

Source: Petar Ristoski, Christian Bizer, and Heiko Paulheim: Mining the Web of Linked Data with RapidMiner. Semantic Web Challenge, Winner of the Open Track, 2014.

Page 58: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 58

Commercial Applications: Content Management at BBC

Interconnect content management systems of different TV and radio stations.

Similar efforts to connect content repositories at Elsevier and Springer.

Source: http://www.w3.org/2001/sw/sweo/public/UseCases/BBC

Page 59: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 59

IBM Rational uses Linked Datatechnologies to connect datafrom different• software development tools

• software lifecycle tools

Goals: 1. Make data independent

of concrete tool (IBM or third party)

2. Allow services (reporting, discovery)to access data from all tools

3. Distributed data space as an alternative to central repository or integration hub / bus

Commercial Applications: Application Integration at IBM

Source: http://www.w3.org/2001/sw/sweo/public/UseCases/IBM

Page 60: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 60

Conclusion: Linked Data vs. HTML-embeded Data

Linked Data Microdata, Microformats, RDFa

~ 1000 sources millions of sources

covers wider range of specific topics focused on search engines and facebook

more complex data structures

very simple and shallow data structures

partial ontology agreement strong ontology agreement

data integration eased by RDF links data integration often requires NLP techniques

various application prototypessome industrial uptake

strong application pull by search engines

Page 61: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 61

3. Knowledge Graphs

Google Knowledge Graph• development started 2012, builds on Freebase

• 570 million objects described by over 18 billion facts (2012)

• 1500 classes, 35,000 properties

Microsoft Satori Knowledge Base• revealed to the public in mid-2013

Yahoo Knowledge Graph• revealed to the public early-2014

Knowledge Graphs employ RDF-style graph data models

Large cross-domain knowledge bases which aim to cover all “relevant” entities in the world.

Page 62: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 62

Data Sources used to Build Knowledge Graphs

1. Wikipedia• infoboxes, category system, information extraction from text

2. Open license sources • e.g. CIA World Factbook, MusicBrainz, …

3. Commercial third-party data• e.g. IMDB, company listings, …

4. schema.org annotations in web pages• e.g. contact information for companies

• e.g. logos of companies

Lots of effort is spend on data integration and manual data curation

Page 63: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 63

Application of the Google Knowledge Graph

Enrich search results with knowledge cards and lists

Goal: Fulfil information need without having users navigate to other websites

Page 64: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 64

Application of the Microsoft Knowledge Graph

Page 65: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 65

1. Answer fact queries: “birthdate michael douglas”

2. Compare things: ”compare eiffel tower vs empire state building”

Applications of the Google Knowledge Graph

Page 66: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 66

Google Now Smart Cards

Direct answers are especially important in the mobile context

Google Now displays direct answers for 19.45% of the queries (Source: Stone Temple Consulting, 2015)

Medical facts are reviewed by an average of 11.1 doctors (Source: Google)

Page 67: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 67

New SEO Topic: How to influence Knowledge Graphs?

Source: http://searchengineland.com/leveraging-wikidata-gain-google-knowledge-graph-result-219706

Page 68: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 68

Behind-the-Scenes Applications

Google• uses its knowledge graph to identity entities in web pages (Entity Linking)

• Hummingbird ranking algorithm (deployed in 2013) uses knowledge graph as background knowledge for ranking search results.

Yahoo• uses its knowledge graph to “support applications across the company:

• Web Search, Content Understanding

• Recommendation, Personalization, Advertisement”*

Data Integration• becomes matching data sources against knowledge graphs

as intermediate schemata.

Various tasks become easier, if you know all entities in the world.

*Source: Nicolas Torzec, Yahoo 2014

Page 69: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 69

Public Knowledge Graphs

Page 70: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 70

The DBpedia Knowledge Base - Version 2014

Describes 4.58 million things, out of which

4.22  million are classified in a consistent ontology

using 685 classes and 2679 different properties• 1,445,000 persons

• 735,000 places

• 241,000 organizations

• 123,000 music albums

Altogether 3 billion pieces of information (RDF triples)• 580 million were extracted from the English edition of Wikipedia

• 29,000,000 links to external web pages

• 50,000,000 external links into other RDF datasets

DBpedia Internationalization• provides data from 125 Wikipedia language editions for download

• For 28 popular languages DBpedia provides cleaned infobox data

Page 71: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 71

DBpedia @ BIS2015

1. Thursday, 10:00

The Past, Present & Future of DBpedia

Keynote by Dimitris Kontokostas

2. Thursday, 10:45

4th DBpedia Community Meeting

Room 2

Page 72: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 72

Google Knowledge Vault

Research project to build a knowledge base using facts extracted from 1 billion web pages1. Web text (TXT): Entity linking,

relationship extraction

2. HTML trees (DOM): Wrapper induction

3. HTML tables (TBL): Relational tables

4. Semantic Annotations (ANO): schema.org, OGP

Employs probabilistic model for data fusion

Results: 1.6 billion facts • 271 million with confidence >90%

• 90 million not in Freebase

Source: Luna Dong, Evgeniy Gabrilovich, et al.: Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.

Page 73: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 73

Data Sources for Public Research in this Space

1. Common Crawl• ~ 2 billion HTML pages

• updated very couple of months

2. WebDataCommons HTML Tables Corpus• 147 million relational web tables

• selected out of the 11 billion tables contained in the Common Crawl

• http://webdatacommons.org/webtables/

3. WebDataCommons Microdata and RDFa Corpora• 20.4 billion RDF triples

• http://www.webdatacommons.org/structureddata/

4. Billion Triples Challenge Dataset 2014• 4 billion RDF triples crawled from Linked Data sources

• http://km.aifb.kit.edu/projects/btc-2014/

Page 74: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 74

Conclusion: 2001 Article - The Semantic Web

Envisions three things to happen:

1.people publish data in structured form in addition to HTML pages on the Web

2.common vocabularies / ontologies are used to represent data

3.people implement cool applications that do smart things with the available data

Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web. Scientific American, May 2001.

Page 75: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 75

4. Conclusions

1. Publication of Structured Data• there is more data available as most people from research and industry like

• especially, schema.org annotations are currently gaining traction

• exciting test-bed for research on data profiling and data integration techniques

2. Ontological Agreement• exists due to application-pull (Google, Facebook)

• but data source-specific attributes are also important (e.g. in life science or government statistics domain)

3. Applications• the big players are moving (Rich-Snippets, Knowledge Graphs)

• there is a lot of further application potential in the available data

• experimentation in industry, but many efforts are still in the prototype stage

Page 76: Evolving the Web into a Global Dataspace – Advances and Applications

Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 76

Thanks

References• Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa

and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC2014).

• Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains (Slides, Video). 13th International Semantic Web Conference (ISWC2014).

• Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering Microdata Markup. 4th Workshop on Data Extraction and Object Search (DEOS2014).

Detailed statistics on RDFa, Microdata and Microformats adoption• http://www.webdatacommons.org/structureddata/

Detailed statistics on Linked Data adoption• http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/