west – web science & technologies university of koblenz ▪ landau, germany building and using...

Post on 17-Jan-2016






Click to see full reader


WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Building and UsingKnowledge Bases

Steffen Staab

Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

& WeST Team

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 2

Semantic Web

Web Retrieval

Social Web

Multimedia Web

Software Web

Institut WeST – Web Science & Technologies


Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 3

PhD thesis trauma 17 years ago

„Nach dem Auspacken der LPS 105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“

Having unwrapped the LPS 105 – reveals itself to the onlooker - a stable disk drive, which has similarly small volume as the Maxtor.“

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 4


General motivation is not information extraction,

but it is solving tasks!

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 5

General objective: Extracting to LOD


Crucial to know: Ontologies nowadays reflect this structureOntologies are• Modular (vs one to rule them all)• Distributed (vs defined in one place)• Connected (vs isolated templates)• Extensible (vs claimed to be finished)• Lightweight (vs computationally intractable)• Popular ones are used more often (vs people disagreeing)

Ontologies – LEGO style

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 6

Most famous applications

Steve Macbeth (Microsoft): - discussion wrt Schema.org -“about 7% of pages we crawl have mark-up” http://www.w3.org/2012/06/06-schema-minutes.html

LOD Cloud

Google Knowledge Graph Bing gets its own knowledge graph


Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 7


Example ontology-based application 1:

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 8

General objective: Analysing LOD


Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 9


Family‘s analysis of Koblenz LOD + Open Street Map data

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 10


Entrepreneur‘s analysis of Koblenz LOD + Open Street Map data

1. PrizeGerman Linked Open Gov Data Competition 2012

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 11


Example ontology-based application :

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 12

Making Web 2.0 More Accessible

Links Location


Knowledge Tags

low- to midlevel features


GeoNames[Schenk et al; JoWS 2009]

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 13

Choosing between Koblenz – and Koblenz

Video at: http://vimeo.com/2057249

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 14

Contextual Information

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 15

Tag-based refinement

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 16

A tag view of „Koblenz“ & „Castle“

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 17

Semantic Identity – Festung Ehrenbreitstein

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 18

Persons – Celebrities, FOAFers & Flickr Users

Billion Triples Challenge 1. Prize 2008

[Schenk et al; JoWS 2009]

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 19


Now on to information extraction:

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 20

Challenges & Opportunities for IE

Not all web pages are created equal

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 21

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding type instances

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 22

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding relation instances

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 23

Challenges & Opportunities for IE

Some contain concepts and their descriptions, some don‘t

No types here,few relation types

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 24

Challenges & Opportunities for IE

Knowing that they are instances and of which type

Textual indication

Positional indication

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 25

Challenges & Opportunities for IE

To some extent

positional and layout

indications work across

languages and sites

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 26

Challenges & Opportunities for IE


We should not only think about

Web pages, but about Web sites

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 27

Challenges & Opportunities for IE


We should not only think about

Web pages, but about Web sites

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 28

Comparing related work to our objectives

Related work objectives IE on Web pages Acquiring instances and

relationship instances

IE based on linear text

Our objectives IE on Web sites Acquiring items Classifying items in

Instances Concepts Relation instances Relationships

IE also based on spatial position

There is overlap and of course there are exceptions in related work

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 29


The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation

[Oro et al; VLDB 2010]

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 30

Presentation-oriented documents

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 31

Presentation-oriented documents

• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM tree structures are conceptually difficult to query

for the user (or a tool!)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 32

Related Work

Web Query languages Xpath 1.0 and XQuery1.0

Established Too difficult to use for scraping from intricate DOM structures

Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in

term of both usability and efficiency Algebras for creating and querying multimedia interactive

presentations (e.g. ppt) [Subrahmanian et al.]

Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]

generate XPath location paths of DOM nodes can benefit from using Spatial XPath

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 33


The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 34



Representing Spatial Relations between DOM Nodes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 35

Idea: Use Spatial Relations among DOM Nodes

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 36

Spatial DOM (SDOM)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 37

SXPath System Architecture

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 38

Querying for Relations Among Nodes

Rectangular Cardinal Relations (RCR)

Topological Relations

r1 E:NE r2

Spatial models allow for expressing disjunctive relations among regions

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 39

XPath Example

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 40

SXPath Example

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 41

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 42

From XPath 1.0 towards Spatial Querying with SXPath

SXPath features adopts intuitive path notation:

axis::nodetest [pred]*

adds to XPath spatial axes spatial position functions

natural semantics for spatial querying

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 43

SXPath System Architecture

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 44

Complexity Results

Formal model defined in the paper [Oro et al; VLDB 2010]

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 45


The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 46

SXPath System

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 47

Summative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 48

Summative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 49

Summative User Study

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 50


The Bio-Case Motivation The (Biochemical) Deep

Web Contributions

Page-level wrapper induction

Site-wide wrapper generation

Error Correction by Mutual Reinforcement

Conclusions and Future Directions

The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 51

>1000 Life Science DBs, number growing quickly

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 52

Biochemical Web Sites: Observations - 1

Labeled Data

Total Labeled Unlabeled Unlabeled(Redundant)

754 719 19 16

Table 1: Data fields across 20 Biochemical Web sites

Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 53

Biochemical Web Sites: Observations - 2

Dynamic Web Pages

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 54

Biochemical Web Sites: Observations - 3

Rich Site Structure

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 55

Biochemical Web Sites: Observations - 4

Semantics is often only in the report, not in the underlying relational database

Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey was available at http://sabiork.villa-bosch.de/index.html/survey.html

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 56

Biochemical Web Sites: Extraction Tasks

Induce Wrapper

Induce Wrapper

Induce Wrapper

[Mir et al; DILS 2009][Mir et al; ESWC 2010]

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 57


Unsupervised Page-Level Wrapper Induction

Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)

(Acquiring the Schema/Ontology)

Automatic Error Detection and Correction by Mutual Reinforcement

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 58

Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520,,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2 = {C00185, Cellobiose,…, R00306,,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}


Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 59

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 60

Page-Level Wrapper Induction - 3

D1´ = {C00221, beta-D-Glucose, …, R01520,, …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2´ = {C00185, Cellobiose,…, R00306,, … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 61

Page-Level Wrapper Induction - 4

Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)

html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 62

Page-Level Wrapper Induction - 5

Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/


//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot GeneralizeRelative

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 63

Selected Sources

KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 64

Wrapper Induction - Evaluation


KEGG Compoundhttp://www.genome.jp/kegg/ compound/

10 762 3 411 351 46 89.9 53.9

15 759 3 0 100 99.6

KEGG Reactionhttp://www.genome.jp/kegg/ reaction/

10 205 3 173 32 0 100 84.4

15 205 0 0 100 100


22 831 3 595 236 41 93.5 71.6

15 829 2 0 100 99.7


30 600 3 600 0 20 96.7 100

15 600 0 20 96.7 100

Average (based on final wrappers for each source) 99.1 99.8

~9 samples – ~99% P, ~98% R

Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 65

Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)

An efficient approach should ignore these pages We dont need to learn the entire site-structure

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 66

Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive pages of the same class.

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 67

Site-Wide Wrapper Induction: Observations - 3

Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 68

Site-Wide Wrapper Induction

1. Start with C0

2. Follow all classified link-collections

3. Generate wrappers for each set of target pages

4. Determine if new class is formed

5. Add navigation step6. Repeat 2 – 5 for each

new class formed in 4





If C0 != Ci (i>0)S=S+Ci;

Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}





Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 69

Site-Wide Wrapper Induction – Evaluation


MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 70

Error Detection and Correction:Mutual Reinforcement

Observation: Certain data reappear on more than one class of pages

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 71

Error Detection and Correction:Mutual Reinforcement

Reinforcement if reappearing data correctly classified as Data

Otherwise it points to misclassification Label-Data Mismatch

• Correction: Introduce more samples Label-Label Mismatch

• Cannot be detected

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 72

Where to go next?

Reverse engineering production1. LOD

2. Navigation model

3. Interaction model

4. Layout model

Capture this generative model using machine learning Relational learning

• Markov logic programmes?• …?

emitting RDF & RDFS

what belongs to what

(- not treated at all by us so far -)

spatial positioning

Steffen Staab staab@uni-koblenz.de

WeST – Web Science & Technologies

Slide 73


Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.

S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.

Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Thank you for your attention!

top related