information extraction for building knowledge basis

81
WeST – Web Science & Technologies University of Koblenz ▪ Landau, Germany Information Extraction for Building Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics Institute Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

Upload: steffen-staab

Post on 26-Jan-2015

113 views

Category:

Technology


2 download

DESCRIPTION

Presentation given at PUC Rio on March 8, 2012

TRANSCRIPT

Page 1: Information extraction for building knowledge basis

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Information Extraction for

Building Knowledge Bases

Steffen Staab

Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

Page 2: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 2

A FEW SLIDES WHERE WEST COMES FROM

Page 3: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 3

Page 4: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 4

Semantic Web

Web Retrieval

Social Web

Multimedia Web

Software Web

Institut WeST – Web Science & Technologies

GESIS

Page 5: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 5

We (co-)organize conferences and schools

Page 6: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 6

We build applications and develop methods…

BTC 1. Prize 2011

1. PrizeGerman Linked Open Gov Data Competition 2012

BTC 1. Prize 2008 German KM 1. Prize 2011

Page 7: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 7

We teach Web Science

Master in Web Science@Koblenz Free tuition Start Fall 2012 English

2012 Web Science Summer School

Lorentz Center, Leiden, The Netherlands,

9-13 July 2012

Master in eGov@Koblenz Free tuition Start Fall 2012 English

Page 8: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 8

We are active in joint projects

EU Integrated Project ROBUST (10 Partners):Risk and Opportunity management of huge-scale BUSiness communiTy cooperation

EU Live+Gov - Reality Sensing, Mining and Augmentation for Mobile Citizen–Government Dialogue

EU WeGov – where eGovernment meets the eSociety EU IP SocialSensor - Sensing User Generated Input for

Improved Media Discovery and Experience EU Net2 – a networked for networked knowledge EU MOST – Marrying ontologies and Software

Technologies

Page 9: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 9

INFORMATION EXTRACTIONFORBUILDING KNOWLEDGE BASES

Steffen Staab,

Saqib Mir, European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy

Page 10: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 10

GENERAL MOTIVATION

Page 11: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 11

General objective: Extracting to LOD

hasLivedInuseAsExample

Page 12: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 12

General objective: Analysing LOD

hasLivedInuseAsExample

Page 13: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 13

http://lisa.west.uni-koblenz.de/lisa-demo/

Family‘s analysis of Munich LOD + Open Street Map data

Page 14: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 14

http://lisa.west.uni-koblenz.de/lisa-demo/

Entrepreneur‘s analysis of Munich LOD + Open Street Map data

Page 15: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 15

OBSERVATIONS ON INFORMATION EXTRACTION

Page 16: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 16

Challenges & Opportunities for IE

Not all web pages are created equal

Page 17: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 17

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding type instances

Page 18: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 18

Challenges & Opportunities for IE

Some challenges are the same, e.g. finding relation instances

Page 19: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 19

Challenges & Opportunities for IE

Some contain concepts and their descriptions, some don‘t

No types here,few relation types

Page 20: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 20

Challenges & Opportunities for IE

Knowing that they are instances and of which type

Textual indication

Positional indication

Page 21: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 21

Challenges & Opportunities for IE

To some extent

positional and layout

indications work across

languages and sites

Page 22: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 22

Challenges & Opportunities for IE

owl:sameAs

We should not only think about

Web pages, but about Web sites

Page 23: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 23

Challenges & Opportunities for IE

owl:sameAs

We should not only think about

Web pages, but about Web sites

Page 24: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 24

Comparing related work to our objectives

Related work objectives IE on Web pages Acquiring instances and

relationship instances

IE based on linear text

Our objectives IE on Web sites Acquiring items Classifying items in

Instances Concepts Relation instances Relationships

IE also based on spatial position

There is overlap and there are few exceptions in related work

Page 25: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 25

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Page 26: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 26

Presentation-oriented documents

Music band profile

band photo

band name

Acquiring a music band profile: A music band photo that has at east itsdescriptive information

Page 27: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 27

Presentation-oriented documents

Page 28: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 28

Presentation-oriented documents

• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM treee structures are conceptually difficult to

query for the user (or a tool!)

Page 29: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 29

Related Work

Web Query languages Xpath 1.0 and XQuery1.0

Established Too difficult to use for scraping from intricate DOM structures

Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in

term of both usability and efficiency Algebras for creating and querying multimedia interactive

presentations (e.g. ppt) [Subrahmanian et al.]

Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]

generate XPath location paths of DOM nodes can benefit from using Spatial XPath

Page 30: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 30

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Page 31: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 31

b

e

Idea: Use Spatial Relations among DOM Nodes

Page 32: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 32

Idea: Use Spatial Relations among DOM Nodes

Page 33: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 33

Idea: Use Spatial Relations among DOM Nodes

Page 34: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 34

Spatial DOM (SDOM)

Page 35: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 35

Spatial Relations Among Nodes

Rectangular Cardinal Relations (RCR)

Topological Relations

r1 E:NE r2

Spatial models allow for expressing disjunctive relations among regions

Page 36: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 37

XPath Example

Page 37: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 38

SXPath Example

Page 38: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 39

Page 39: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 40

From XPath 1.0 towards Spatial Querying with SXPath

SXPath features adopts intuitive path notation:

axis::nodetest [pred]*

adds to XPath spatial axes spatial position functions

natural semantics for spatial querying maintains polynomial time combined complexity

Page 40: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 41

Why SXPath?

an XPath for Information extraction

web applications

familiarity

Simplicity

resilient wrappers

human oriented

efficiency

Page 41: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 42

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Page 42: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 43

Spatial DOM (SDOM)

Page 43: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 44

Spatial Navigation Axes

Page 44: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 45

Spatial Navigation Axes

Page 45: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 46

Syntax of SXPath

Page 46: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 50

Complexity Results

Page 47: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 51

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Page 48: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 52

SXPath System Architecture

Page 49: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 53

SXPath System

Page 50: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 54

Results of Experiments

Page 51: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 55

Formative User Study

Page 52: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 56

Summative User Study

Page 53: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 57

Summative User Study

Page 54: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 58

Summative User Study

Page 55: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 59

Existing Extensions to PDF

Page 56: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 60

Table

Page Header

Page Footer

Text Area and Paragraphs

Item List

Page Number

Page 57: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 61

Outline

The Bio-Case Motivation The (Biochemical) Deep

Web Contributions

Page-level wrapper induction

Site-wide wrapper generation

Error Correction by Mutual Reinforcement

Conclusions and Future Directions

The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation

Page 58: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 62

>1000 Life Science DBs, number growing quickly

Page 59: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 63

Biochemical Web Sites: Observations - 1

Labeled Data

Total Labeled Unlabeled Unlabeled(Redundant)

754 719 19 16

Table 1: Data fields across 20 Biochemical Web sites

Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)

Page 60: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 64

Biochemical Web Sites: Observations - 2

Dynamic Web Pages

Page 61: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 65

Biochemical Web Sites: Observations - 3

Rich Site Structure

Page 62: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 66

Biochemical Web Sites: Observations - 4

Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey available at http://sabiork.villa-bosch.de/index.html/survey.html

Page 63: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 67

Biochemical Web Sites: Implications

Induce Wrapper

Induce Wrapper

Induce Wrapper

Page 64: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 68

Contributions

Unsupervised Page-Level Wrapper Induction

Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)

Automatic Error Detection and Correction by Mutual Reinforcement

Page 65: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 69

Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

//*[text()]

Page 66: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 70

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions

Page 67: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 71

Page-Level Wrapper Induction - 3

D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

Page 68: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 72

Page-Level Wrapper Induction - 4

Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)

html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)

Page 69: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 73

Page-Level Wrapper Induction - 5

Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/

html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]

//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot GeneralizeRelative

Page 70: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 74

Selected Sources

KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data

Page 71: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 75

Wrapper Induction - Evaluation

SOURCE #L #D #S TP FN FP P R

KEGG Compoundhttp://www.genome.jp/kegg/ compound/

10 762 3 411 351 46 89.9 53.9

15 759 3 0 100 99.6

KEGG Reactionhttp://www.genome.jp/kegg/ reaction/

10 205 3 173 32 0 100 84.4

15 205 0 0 100 100

ChEBIhttp://www.ebi.ac.uk/chebi

22 831 3 595 236 41 93.5 71.6

15 829 2 0 100 99.7

MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/

30 600 3 600 0 20 96.7 100

15 600 0 20 96.7 100

Average (based on final wrappers for each source) 99.1 99.8

~9 samples – ~99% P, ~98% R

Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)

Page 72: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 76

Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)

An efficient approach should ignore these pages We dont need to learn the entire site-structure

Page 73: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 77

Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive pages of the same class.

Page 74: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 78

Site-Wide Wrapper Induction: Observations - 3

Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same

Page 75: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 79

Site-Wide Wrapper Induction

1. Start with C0

2. Follow all classified link-collections

3. Generate wrappers for each set of target pages

4. Determine if new class is formed

5. Add navigation step6. Repeat 2 – 5 for each

new class formed in 4

C0

L3

L1

L2

If C0 != Ci (i>0)S=S+Ci;

Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}

S={C0}

C1

C3

C2

Page 76: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 80

Site-Wide Wrapper Induction – Evaluation

SOURCE #C #C’ #D TP FN FP P R

MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)

Page 77: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 81

Error Detection and Correction:Mutual Reinforcement

Observation: Certain data reappear on more than one class of pages

Page 78: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 82

Error Detection and Correction:Mutual Reinforcement

Reinforcement if reappearing data correctly classified as Data

Otherwise it points to misclassification Label-Data Mismatch

• Correction: Introduce more samples Label-Label Mismatch

• Cannot be detected

Page 79: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 83

Where to go next?

Reverse engineering production1. LOD

2. Navigation model

3. Interaction model

4. Layout model

Capture this generative model using machine learning Relational learning

• Markov logic programmes?• …?

emitting RDF & RDFS

what belongs to what

(- not treated at all by us so far -)

spatial positioning

Page 80: Information extraction for building knowledge basis

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 84

Bibliography

Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.

S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.

Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.

Page 81: Information extraction for building knowledge basis

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Thank you for your attention!