building and using knowledge bases
DESCRIPTION
Building and Using Knowledge Bases. Steffen Staab Saqib Mir – European Bioinformatics Institute Ermelinda d‘Oro , Massimo Ruffolo – Univ. Calabria, Italy & WeST Team. Institut WeST – Web Science & Technologies. Semantic Web. Web Retrieval. Social Web. Multimedia Web. - PowerPoint PPT PresentationTRANSCRIPT
WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Building and UsingKnowledge Bases
Steffen StaabSaqib Mir – European Bioinformatics Institute
Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy& WeST Team
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 2
Semantic Web
Web Retrieval
Social Web
Multimedia Web
Software Web
Institut WeST – Web Science & Technologies
GESIS
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 3
PhD thesis trauma 17 years ago
„Nach dem Auspacken der LPS 105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“
Having unwrapped the LPS 105 – reveals itself to the onlooker - a stable disk drive, which has similarly small volume as the Maxtor.“
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 4
GENERAL MOTIVATION
General motivation is not information extraction, but it is solving tasks!
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 5
General objective: Extracting to LOD
hasLivedInuseAsExample
Crucial to know: Ontologies nowadays reflect this structureOntologies are• Modular (vs one to rule them all)• Distributed (vs defined in one place)• Connected (vs isolated templates)• Extensible (vs claimed to be finished)• Lightweight (vs computationally intractable)• Popular ones are used more often (vs people disagreeing)
Ontologies – LEGO style
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 6
Most famous applications
Steve Macbeth (Microsoft): - discussion wrt Schema.org -“about 7% of pages we crawl have mark-up” http://www.w3.org/2012/06/06-schema-minutes.html
LOD Cloud
Google Knowledge Graph Bing gets its own knowledge graph
http://searchengineland.com/bing-britannica-partnership-123930
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 7
ANALYSIS OF URBAN PARAMETERS
Example ontology-based application 1:
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 8
General objective: Analysing LOD
hasLivedInuseAsExample
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 9
http://lisa.west.uni-koblenz.de/lisa-demo/Family‘s analysis of Koblenz LOD + Open Street Map data
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 10
http://lisa.west.uni-koblenz.de/lisa-demo/Entrepreneur‘s analysis of Koblenz LOD + Open Street Map data
1. PrizeGerman Linked Open Gov Data Competition 2012
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 11
FACETED MULTIMEDIA EXPLORATION
Example ontology-based application :
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 12
Making Web 2.0 More Accessible
Links Location
Persons
Knowledge Tags
low- to midlevel features
xxxxxxxxx
GeoNames[Schenk et al; JoWS 2009]
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 13
Choosing between Koblenz – and Koblenz
Video at: http://vimeo.com/2057249
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 16
A tag view of „Koblenz“ & „Castle“
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 17
Semantic Identity – Festung Ehrenbreitstein
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 18
Persons – Celebrities, FOAFers & Flickr Users
Billion Triples Challenge 1. Prize 2008
[Schenk et al; JoWS 2009]
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 19
OBSERVATIONS ON INFORMATION EXTRACTION
Now on to information extraction:
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 20
Challenges & Opportunities for IE
Not all web pages are created equal
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 21
Challenges & Opportunities for IE
Some challenges are the same, e.g. finding type instances
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 22
Challenges & Opportunities for IE
Some challenges are the same, e.g. finding relation instances
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 23
Challenges & Opportunities for IE
Some contain concepts and their descriptions, some don‘t
No types here,few relation types
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 24
Challenges & Opportunities for IE
Knowing that they are instances and of which type
Textual indication
Positional indication
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 25
Challenges & Opportunities for IE
To some extent positional and layout indications work across languages and sites
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 26
Challenges & Opportunities for IE
owl:sameAs
We should not only think aboutWeb pages, but about Web sites
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 27
Challenges & Opportunities for IE
owl:sameAs
We should not only think aboutWeb pages, but about Web sites
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 28
Comparing related work to our objectives
Related work objectives IE on Web pages Acquiring instances and
relationship instances
IE based on linear text
Our objectives IE on Web sites Acquiring items Classifying items in
Instances Concepts Relation instances Relationships
IE also based on spatial position
There is overlap and of course there are exceptions in related work
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 29
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation
[Oro et al; VLDB 2010]
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 30
Presentation-oriented documents
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 31
Presentation-oriented documents
• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM tree structures are conceptually difficult to query
for the user (or a tool!)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 32
Related Work
Web Query languages Xpath 1.0 and XQuery1.0
Established Too difficult to use for scraping from intricate DOM structures
Visual languages Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]
generate XPath location paths of DOM nodes can benefit from using Spatial XPath
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 33
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 34
b
e
Representing Spatial Relations between DOM Nodes
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 35
Idea: Use Spatial Relations among DOM Nodes
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 37
SXPath System Architecture
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 38
Querying for Relations Among Nodes
Rectangular Cardinal Relations (RCR)
Topological Relations
r1 E:NE r2
Spatial models allow for expressing disjunctive relations among regions
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 42
From XPath 1.0 towards Spatial Querying with SXPath
SXPath features adopts intuitive path notation:
axis::nodetest [pred]*
adds to XPath spatial axes spatial position functions
natural semantics for spatial querying
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 43
SXPath System Architecture
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 44
Complexity Results
Formal model defined in the paper [Oro et al; VLDB 2010]
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 45
Outline
The Bio-CaseThe Social Media-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 50
Outline
The Bio-Case Motivation The (Biochemical) Deep
Web Contributions
Page-level wrapper induction
Site-wide wrapper generation
Error Correction by Mutual Reinforcement
Conclusions and Future Directions
The Social Media Case Motivation State-of-the-Art Core idea of SXPath SXPath Language
Spatial Data Model Syntax & Semantics Complexity
Implementation Evaluation
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 51
>1000 Life Science DBs, number growing quickly
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 52
Biochemical Web Sites: Observations - 1
Labeled Data
Total Labeled Unlabeled Unlabeled(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites
Full survey:http://sabio.villa-bosch.de/labelsurvey.html (404)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 53
Biochemical Web Sites: Observations - 2
Dynamic Web Pages
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 54
Biochemical Web Sites: Observations - 3
Rich Site Structure
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 55
Biochemical Web Sites: Observations - 4
Semantics is often only in the report, not in the underlying relational database
Web Services Survey: 11 of 100 Databases1 provide APIs Incomplete coverage Varying granularity No semantics in the service description
1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey was available at http://sabiork.villa-bosch.de/index.html/survey.html
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 56
Biochemical Web Sites: Extraction Tasks
Induce Wrapper
Induce Wrapper
Induce Wrapper
[Mir et al; DILS 2009][Mir et al; ESWC 2010]
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 57
Contributions
Unsupervised Page-Level Wrapper Induction
Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)
(Acquiring the Schema/Ontology)
Automatic Error Detection and Correction by Mutual Reinforcement
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 58
Page-Level Wrapper Induction – 1D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
//*[text()]
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 59
Page-Level Wrapper Induction - 2
Reclassify – Growing Data Regions
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 60
Page-Level Wrapper Induction - 3D1´ = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}O1´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}
D2´ = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }O2´ = {Entry, Name,…, Reaction, R00026, Enzyme,…,}
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 61
Page-Level Wrapper Induction - 4
Selecting Labels for Datahtml/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” )
html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”)
html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 62
Page-Level Wrapper Induction - 5
Anchor the PathEnzyme - html/table[1]/tr[8]/th[1]/code[1]/
html/table[1]/tr[8]/td[1]/code[1]/a[1]html/table[1]/tr[8]/td[1]/code[1]/a[2]
//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()
Pivot GeneralizeRelative
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 63
Selected Sources
KEGG, ChEBI, MSDChem Basic qualitative data Popular Overlapping/complementary data
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 64
Wrapper Induction - Evaluation
SOURCE #L #D #S TP FN FP P R
KEGG Compoundhttp://www.genome.jp/kegg/ compound/
10 762 3 411 351 46 89.9 53.9
15 759 3 0 100 99.6
KEGG Reactionhttp://www.genome.jp/kegg/ reaction/
10 205 3 173 32 0 100 84.4
15 205 0 0 100 100
ChEBIhttp://www.ebi.ac.uk/chebi
22 831 3 595 236 41 93.5 71.6
15 829 2 0 100 99.7
MSDChemhttp://www.ebi.ac.uk/msd-srv/msdchem/
30 600 3 600 0 20 96.7 100
15 600 0 20 96.7 100
Average (based on final wrappers for each source) 99.1 99.8
~9 samples – ~99% P, ~98% R
Table 2: Page-level wrapper induction results, 20 test pages(L=Labels, D=Data entries, S=Training pages)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 65
Site-Wide Wrapper Induction: Observations
Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)
An efficient approach should ignore these pages We dont need to learn the entire site-structure
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 66
Site-Wide Wrapper Induction: Observations - 2
Classified Link-Collections point to data-intensive pages of the same class.
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 67
Site-Wide Wrapper Induction: Observations - 3
Pages belong to the same class describe the same concepts Some concepts are sometimes omitted Ordering is always the same
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 68
Site-Wide Wrapper Induction
1. Start with C0
2. Follow all classified link-collections
3. Generate wrappers for each set of target pages
4. Determine if new class is formed
5. Add navigation step6. Repeat 2 – 5 for each
new class formed in 4
C0
L3
L1
L2
If C0 != Ci (i>0)S=S+Ci;
Navigation StepsW= {(C0 → L1→ C0),(C0 → L2→ C2),(C0 → L3→ C3)}
S={C0}
C1
C3C2
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 69
Site-Wide Wrapper Induction – EvaluationSOURCE #C #C’ #D TP FN FP P R
MSDChem 1 1 N/A N/A N/A N/A N/A N/A
ChEBI 3 1 1711 1195 516 0 100 69.8
KEGG 10 7 6223 5044 1179 188 97 81.1
Average 98.5 75.5
Table 3: Site-wide wrapper induction results, 20 test pages for each class(C=Classes, C´=Classes discovered, D=Data entries)
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 70
Error Detection and Correction:Mutual Reinforcement
Observation: Certain data reappear on more than one class of pages
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 71
Error Detection and Correction:Mutual Reinforcement Reinforcement if reappearing data correctly classified as
Data Otherwise it points to misclassification
Label-Data Mismatch• Correction: Introduce more samples
Label-Label Mismatch• Cannot be detected
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 72
Where to go next?
Reverse engineering production1. LOD 2. Navigation model 3. Interaction model 4. Layout model
Capture this generative model using machine learning Relational learning
• Markov logic programmes?• …?
emitting RDF & RDFSwhat belongs to what
(- not treated at all by us so far -)spatial positioning
Steffen Staab [email protected]
WeST – Web Science & Technologies
Slide 73
Bibliography
Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.
S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.
Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.
WeST – Web Science & TechnologiesUniversity of Koblenz ▪ Landau, Germany
Thank you for your attention!