information extraction for building knowledge basis

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Information Extraction for

Building Knowledge Bases

Steffen Staab

Saqib Mir – European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

Steffen Staab [email protected]

WeST – Web Science & Technologies

Slide 2

A FEW SLIDES WHERE WEST COMES FROM



Slide 3



Slide 4

Semantic Web

Web Retrieval

Social Web

Multimedia Web

Software Web

Institut WeST – Web Science & Technologies

GESIS



Slide 5

We (co-)organize conferences and schools



Slide 6

We build applications and develop methods…

BTC 1. Prize 2011

1. PrizeGerman Linked Open Gov Data Competition 2012

BTC 1. Prize 2008 German KM 1. Prize 2011



Slide 7

We teach Web Science

Master in Web Science@Koblenz Free tuition Start Fall 2012 English

2012 Web Science Summer School

Lorentz Center, Leiden, The Netherlands,

9-13 July 2012

Master in eGov@Koblenz Free tuition Start Fall 2012 English



Slide 8

We are active in joint projects

EU Integrated Project ROBUST (10 Partners):Risk and Opportunity management of huge-scale BUSiness communiTy cooperation

EU Live+Gov - Reality Sensing, Mining and Augmentation for Mobile Citizen–Government Dialogue

EU WeGov – where eGovernment meets the eSociety EU IP SocialSensor - Sensing User Generated Input for

Improved Media Discovery and Experience EU Net2 – a networked for networked knowledge EU MOST – Marrying ontologies and Software

Technologies



Slide 9

INFORMATION EXTRACTIONFORBUILDING KNOWLEDGE BASES

Steffen Staab,

Saqib Mir, European Bioinformatics InstituteErmelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy



Slide 10

GENERAL MOTIVATION



Slide 11

General objective: Extracting to LOD

hasLivedInuseAsExample



Slide 12

General objective: Analysing LOD

hasLivedInuseAsExample



Slide 13

http://lisa.west.uni-koblenz.de/lisa-demo/

Family‘s analysis of Munich LOD + Open Street Map data



Slide 14

http://lisa.west.uni-koblenz.de/lisa-demo/

Entrepreneur‘s analysis of Munich LOD + Open Street Map data



Slide 15

OBSERVATIONS ON INFORMATION EXTRACTION



Slide 16

Challenges & Opportunities for IE

Not all web pages are created equal



Slide 17


Some challenges are the same, e.g. finding type instances



Slide 18


Some challenges are the same, e.g. finding relation instances



Slide 19


Some contain concepts and their descriptions, some don‘t

No types here,few relation types



Slide 20


Knowing that they are instances and of which type

Textual indication

Positional indication



Slide 21


To some extent

positional and layout

indications work across

languages and sites



Slide 22


owl:sameAs

We should not only think about

Web pages, but about Web sites



Slide 23


owl:sameAs

We should not only think about

Web pages, but about Web sites



Slide 24

Comparing related work to our objectives

Related work objectives IE on Web pages Acquiring instances and

relationship instances

IE based on linear text

Our objectives IE on Web sites Acquiring items Classifying items in

Instances Concepts Relation instances Relationships

IE also based on spatial position

There is overlap and there are few exceptions in related work



Slide 25

Outline

The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath SXPath Language

Spatial Data Model Syntax & Semantics Complexity

Implementation Evaluation



Slide 26

Presentation-oriented documents

Music band profile

band photo

band name

Acquiring a music band profile: A music band photo that has at east itsdescriptive information



Slide 27




Slide 28


• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM treee structures are conceptually difficult to

query for the user (or a tool!)



Slide 29

Related Work

Web Query languages Xpath 1.0 and XQuery1.0

Established Too difficult to use for scraping from intricate DOM structures

Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in

term of both usability and efficiency Algebras for creating and querying multimedia interactive

presentations (e.g. ppt) [Subrahmanian et al.]

Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]

generate XPath location paths of DOM nodes can benefit from using Spatial XPath



Slide 30

Outline






Slide 31

b

e

Idea: Use Spatial Relations among DOM Nodes



Slide 32




Slide 33




Slide 34

Spatial DOM (SDOM)



Slide 35

Spatial Relations Among Nodes

Rectangular Cardinal Relations (RCR)

Topological Relations

r1 E:NE r2

Spatial models allow for expressing disjunctive relations among regions



Slide 37

XPath Example



Slide 38

SXPath Example



Slide 39



Slide 40

From XPath 1.0 towards Spatial Querying with SXPath

SXPath features adopts intuitive path notation:

axis::nodetest [pred]*

adds to XPath spatial axes spatial position functions

natural semantics for spatial querying maintains polynomial time combined complexity



Slide 41

Why SXPath?

an XPath for Information extraction

web applications

familiarity

Simplicity

resilient wrappers

human oriented

efficiency



Slide 42

Outline






Slide 43

Spatial DOM (SDOM)



Slide 44

Spatial Navigation Axes



Slide 45

Spatial Navigation Axes



Slide 46

Syntax of SXPath



Slide 50

Complexity Results



Slide 51

Outline






Slide 52

SXPath System Architecture



Slide 53

SXPath System



Slide 54

Results of Experiments



Slide 55

Formative User Study



Slide 56

Summative User Study



Slide 57




Slide 58




Slide 59

Existing Extensions to PDF



Slide 60

Table

Page Header

Page Footer

Text Area and Paragraphs

Item List

Page Number



Slide 61

Outline

The Bio-Case Motivation The (Biochemical) Deep

Web Contributions

Page-level wrapper induction

Site-wide wrapper generation

Error Correction by Mutual Reinforcement

Conclusions and Future Directions

The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language





Slide 62

>1000 Life Science DBs, number growing quickly



Slide 63

Biochemical Web Sites: Observations - 1

Labeled Data

Total Labeled Unlabeled Unlabeled(Redundant)

754 719 19 16

Table 1: Data fields across 20 Biochemical Web sites

Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)

http://sabio.villa-bosch.de/labelsurvey.html

http://sabio.villa-bosch.de/labelsurvey.html



Slide 64


Dynamic Web Pages



Slide 65


Rich Site Structure



Slide 66


Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey available at http://sabiork.villa-bosch.de/index.html/survey.html



Slide 67

Biochemical Web Sites: Implications

Induce Wrapper

Induce Wrapper

Induce Wrapper



Slide 68

Contributions

Unsupervised Page-Level Wrapper Induction

Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)

Automatic Error Detection and Correction by Mutual Reinforcement



Slide 69

Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

//*[text()]



Slide 70

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions



Slide 71


D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}



Slide 72


Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)

html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)



Slide 73


Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/

html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]

//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot GeneralizeRelative



Slide 74

Selected Sources

KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data



Slide 75

Wrapper Induction - Evaluation

SOURCE #L #D #S TP FN FP P R

KEGG Compoundhttp://www.genome.jp/kegg/ compound/

10 762 3 411 351 46 89.9 53.9

15 759 3 0 100 99.6

KEGG Reactionhttp://www.genome.jp/kegg/ reaction/

10 205 3 173 32 0 100 84.4

15 205 0 0 100 100

ChEBIhttp://www.ebi.ac.uk/chebi

22 831 3 595 236 41 93.5 71.6

15 829 2 0 100 99.7

MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/

30 600 3 600 0 20 96.7 100

15 600 0 20 96.7 100

Average (based on final wrappers for each source) 99.1 99.8

~9 samples – ~99% P, ~98% R

Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)



Slide 76

Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)

An efficient approach should ignore these pages We dont need to learn the entire site-structure



Slide 77

Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive pages of the same class.



Slide 78

Site-Wide Wrapper Induction: Observations - 3

Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same



Slide 79

Site-Wide Wrapper Induction

1. Start with C0

2. Follow all classified link-collections

3. Generate wrappers for each set of target pages

4. Determine if new class is formed

5. Add navigation step6. Repeat 2 – 5 for each

new class formed in 4

C0

L3

L1

L2

If C0 != Ci (i>0)S=S+Ci;

Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}

S={C0}

C1

C3

C2



Slide 80

Site-Wide Wrapper Induction – Evaluation

SOURCE #C #C’ #D TP FN FP P R

MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)



Slide 81

Error Detection and Correction:Mutual Reinforcement

Observation: Certain data reappear on more than one class of pages



Slide 82

Error Detection and Correction:Mutual Reinforcement

Reinforcement if reappearing data correctly classified as Data

Otherwise it points to misclassification Label-Data Mismatch

• Correction: Introduce more samples Label-Label Mismatch

• Cannot be detected



Slide 83

Where to go next?

Reverse engineering production1. LOD

2. Navigation model

3. Interaction model

4. Layout model

Capture this generative model using machine learning Relational learning

• Markov logic programmes?• …?

emitting RDF & RDFS

what belongs to what

(- not treated at all by us so far -)

spatial positioning



Slide 84

Bibliography

Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.

S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.

Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.

WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany

Thank you for your attention!

information extraction for building knowledge basis

Technology

web pages

web sciencemaster

web sites owl

koblenzweb science

challenges opportunities

iesome challenges

networked knowledge

itemsrelationship instances