understanding the hidden web - pierre senellart · p. senellart (inria & u. paris-sud)...
TRANSCRIPT
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Understanding the Hidden Web
Pierre Senellart
Max-Planck-Institut für Informatik, 22 August 2007
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 1 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
The Hidden Web
Definition (Hidden Web, Deep Web, Invisible Web)The part of Web content not accessible from the hyperlinked structureof the World Wide Web. Typically: HTML forms, Web Services.
Size estimate (2001) : 500 times larger than the surface Web.
How to understand it and benefit from its content?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
The Hidden Web
Definition (Hidden Web, Deep Web, Invisible Web)The part of Web content not accessible from the hyperlinked structureof the World Wide Web. Typically: HTML forms, Web Services.
Size estimate (2001) : 500 times larger than the surface Web.
How to understand it and benefit from its content?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
The Hidden Web
Definition (Hidden Web, Deep Web, Invisible Web)The part of Web content not accessible from the hyperlinked structureof the World Wide Web. Typically: HTML forms, Web Services.
Size estimate (2001) : 500 times larger than the surface Web.
How to understand it and benefit from its content?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Understanding the Hidden Web
PurposeIntensional indexing of the Hidden Web.
High-level queries.
) a semantic search engine over the Hidden Web.
In a fully automatic, unsupervised, way!
Difficult and broad problem.
Use of domain knowledge (ontology, instances).
Example of the database publication domain.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Understanding the Hidden Web
PurposeIntensional indexing of the Hidden Web.
High-level queries.
) a semantic search engine over the Hidden Web.
In a fully automatic, unsupervised, way!
Difficult and broad problem.
Use of domain knowledge (ontology, instances).
Example of the database publication domain.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Understanding the Hidden Web
PurposeIntensional indexing of the Hidden Web.
High-level queries.
) a semantic search engine over the Hidden Web.
In a fully automatic, unsupervised, way!
Difficult and broad problem.
Use of domain knowledge (ontology, instances).
Example of the database publication domain.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Understanding the Hidden Web
PurposeIntensional indexing of the Hidden Web.
High-level queries.
) a semantic search engine over the Hidden Web.
In a fully automatic, unsupervised, way!
Difficult and broad problem.
Use of domain knowledge (ontology, instances).
Example of the database publication domain.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Understanding the Hidden Web
PurposeIntensional indexing of the Hidden Web.
High-level queries.
) a semantic search engine over the Hidden Web.
In a fully automatic, unsupervised, way!
Difficult and broad problem.
Use of domain knowledge (ontology, instances).
Example of the database publication domain.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW
HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pages
Web service
AnalyzedWeb service
Service index
User
discovery
discovery
probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Imprecise Data and Imprecise Tasks
ObservationsMany needed tasks generate imprecise data, with some confidencevalue.
Need for a way to manage this imprecision, to work with itthroughout an entire complex process.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 5 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Imprecise Data and Imprecise Tasks
ObservationsMany needed tasks generate imprecise data, with some confidencevalue.
Need for a way to manage this imprecision, to work with itthroughout an entire complex process.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 5 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidenceQuery Results
+ confidence
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidenceQuery Results
+ confidence
Topic crawler Form analyzer Inf. Extractor
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidence
Query Results+ confidence
Topic crawler Form analyzer Inf. Extractor
CrawledURLs
+ confidence
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidence
Query
Results+ confidence
Topic crawler Form analyzer Inf. Extractor
Form URL?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidenceQuery
Results+ confidence
Topic crawler Form analyzer Inf. Extractor
URLs+ confidence
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidence
Query Results+ confidence
Topic crawler Form analyzer Inf. Extractor
Analyzedform
+ confidence
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidence
Query
Results+ confidence
Topic crawler Form analyzer Inf. Extractor
Form?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidenceQuery
Results+ confidence
Topic crawler Form analyzer Inf. Extractor
Person! �
+ confidence
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
A Probabilistic XML Warehouse (Hidden Web)
Module 1 Module 2 Module 3
Update interface Query interface
Probabilistic XML Warehouse
Updatetransaction
+ confidence
Query Results+ confidence
Topic crawler Form analyzer Inf. Extractor
Person ! ISBN+ confidence
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Outline
1 Introduction
2 A Probabilistic XML Data Model
3 Probing the Hidden Web
4 Wrapper Induction from Result Pages
5 Deriving Schema Mappings from Database Instances
6 Semantic Model
7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 7 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Probabilistic Trees
Framework Unordered data treesDetails: no attributes, no mixed content. . .
A
B C
D
6=
A
B B C
D
(multiset semantics)
Sample space: Set of all such data trees.
Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Probabilistic Trees
Framework Unordered data treesDetails: no attributes, no mixed content. . .
A
B C
D
6=
A
B B C
D
(multiset semantics)
Sample space: Set of all such data trees.
Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Probabilistic Trees
Framework Unordered data treesDetails: no attributes, no mixed content. . .
A
B C
D
6=
A
B B C
D
(multiset semantics)
Sample space: Set of all such data trees.
Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Probabilistic Trees
Framework Unordered data treesDetails: no attributes, no mixed content. . .
A
B C
D
6=
A
B B C
D
(multiset semantics)
Sample space: Set of all such data trees.
Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Probabilistic Trees
Framework Unordered data treesDetails: no attributes, no mixed content. . .
A
B C
D
6=
A
B B C
D
(multiset semantics)
Sample space: Set of all such data trees.
Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
The Prob-Tree Model
Data tree with event conditions (conjunction of probabilisticevents or negations of probabilistic events) assigned to each node.
Probabilistic events are boolean random variables, assumed to beindependent, with their own probability distribution.
A
Bw1;:w2
C
Dw2
Event Prob.w1 0:8w2 0:7
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 9 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Features of the Prob-Tree Model
Well-defined possible world semantics.
Full expressive power, reasonable conciseness.
Possible to apply query and updates directly on prob-trees, in anefficient way.
Complexity study.
Implementation available.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Features of the Prob-Tree Model
Well-defined possible world semantics.
Full expressive power, reasonable conciseness.
Possible to apply query and updates directly on prob-trees, in anefficient way.
Complexity study.
Implementation available.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Features of the Prob-Tree Model
Well-defined possible world semantics.
Full expressive power, reasonable conciseness.
Possible to apply query and updates directly on prob-trees, in anefficient way.
Complexity study.
Implementation available.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Features of the Prob-Tree Model
Well-defined possible world semantics.
Full expressive power, reasonable conciseness.
Possible to apply query and updates directly on prob-trees, in anefficient way.
Complexity study.
Implementation available.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Features of the Prob-Tree Model
Well-defined possible world semantics.
Full expressive power, reasonable conciseness.
Possible to apply query and updates directly on prob-trees, in anefficient way.
Complexity study.
Implementation available.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Outline
1 Introduction
2 A Probabilistic XML Data Model
3 Probing the Hidden Web
4 Wrapper Induction from Result Pages
5 Deriving Schema Mappings from Database Instances
6 Semantic Model
7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 11 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Analyzing HTML Forms
Analyzing the structure of HTML forms.
ProblemAssociate to each relevant form field its corresponding domain concept.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 12 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
First Step: Structural Analysis
1 Build a context for each field:
label tag;id and name attributes;text immediately before the field.
2 Remove stop words, stem.3 Match this context with the concept names, extended with
WordNet.4 Obtain in this way candidate annotations.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
First Step: Structural Analysis
1 Build a context for each field:
label tag;id and name attributes;text immediately before the field.
2 Remove stop words, stem.3 Match this context with the concept names, extended with
WordNet.4 Obtain in this way candidate annotations.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
First Step: Structural Analysis
1 Build a context for each field:
label tag;id and name attributes;text immediately before the field.
2 Remove stop words, stem.3 Match this context with the concept names, extended with
WordNet.4 Obtain in this way candidate annotations.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
First Step: Structural Analysis
1 Build a context for each field:
label tag;id and name attributes;text immediately before the field.
2 Remove stop words, stem.3 Match this context with the concept names, extended with
WordNet.4 Obtain in this way candidate annotations.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Second Step: Confirm Annotations with Probing
For each field annotated with a concept c:
1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the
frequency distribution of c).3 Compare pages obtained by probing with the error page (by using
clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.
4 Confirm the annotation if enough result pages are obtained.
In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Second Step: Confirm Annotations with Probing
For each field annotated with a concept c:
1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the
frequency distribution of c).3 Compare pages obtained by probing with the error page (by using
clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.
4 Confirm the annotation if enough result pages are obtained.
In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Second Step: Confirm Annotations with Probing
For each field annotated with a concept c:
1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the
frequency distribution of c).3 Compare pages obtained by probing with the error page (by using
clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.
4 Confirm the annotation if enough result pages are obtained.
In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Second Step: Confirm Annotations with Probing
For each field annotated with a concept c:
1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the
frequency distribution of c).3 Compare pages obtained by probing with the error page (by using
clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.
4 Confirm the annotation if enough result pages are obtained.
In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32
Joint work with Avin Mittal (IIT Bombay).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Second Step: Confirm Annotations with Probing
For each field annotated with a concept c:
1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the
frequency distribution of c).3 Compare pages obtained by probing with the error page (by using
clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.
4 Confirm the annotation if enough result pages are obtained.
In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Outline
1 Introduction
2 A Probabilistic XML Data Model
3 Probing the Hidden Web
4 Wrapper Induction from Result Pages
5 Deriving Schema Mappings from Database Instances
6 Semantic Model
7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 15 / 32
Joint work with researchers from mostrare (INRIA Futurs).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Query-answer Web Pages
Extract data from query-answer Web pages.
IssuesWhat part of the Web page contains the answer?
How to extract structured content?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 16 / 32
Joint work with researchers from mostrare (INRIA Futurs).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Query-answer Web Pages
Extract data from query-answer Web pages.
IssuesWhat part of the Web page contains the answer?
How to extract structured content?
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 16 / 32
Joint work with researchers from mostrare (INRIA Futurs).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Automatic Wrapper Induction with DomainKnowledge
Annotate pages with knowledge domain (finite automatatechniques): Both imperfect and incomplete.
Use machine learning to generalize the result into a structuralextraction wrapper (Conditional Random Fields).
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 17 / 32
Joint work with researchers from mostrare (INRIA Futurs).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Automatic Wrapper Induction with DomainKnowledge
Annotate pages with knowledge domain (finite automatatechniques): Both imperfect and incomplete.
Use machine learning to generalize the result into a structuralextraction wrapper (Conditional Random Fields).
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 17 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Outline
1 Introduction
2 A Probabilistic XML Data Model
3 Probing the Hidden Web
4 Wrapper Induction from Result Pages
5 Deriving Schema Mappings from Database Instances
6 Semantic Model
7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 18 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Motivation
Analyzing the relations between different sources, or between a sourceand the domain knowledge.ProblemGiven two database instances I and J with different schemata, what isthe optimal description � of J with respect to I (with � a finite set offormulæ in some logical language)?
What does optimal implies:
Conciseness of description.Validity of facts predicted by I and �.Facts of J explained by I and �.
(Note the asymmetry between I and J ; context of data exchangewhere J is computed from I and �).
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Motivation
Analyzing the relations between different sources, or between a sourceand the domain knowledge.ProblemGiven two database instances I and J with different schemata, what isthe optimal description � of J with respect to I (with � a finite set offormulæ in some logical language)?
What does optimal implies:
Conciseness of description.Validity of facts predicted by I and �.Facts of J explained by I and �.
(Note the asymmetry between I and J ; context of data exchangewhere J is computed from I and �).
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Motivation
Analyzing the relations between different sources, or between a sourceand the domain knowledge.ProblemGiven two database instances I and J with different schemata, what isthe optimal description � of J with respect to I (with � a finite set offormulæ in some logical language)?
What does optimal implies:
Conciseness of description.Validity of facts predicted by I and �.Facts of J explained by I and �.
(Note the asymmetry between I and J ; context of data exchangewhere J is computed from I and �).
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Example (Tuple-Generating Dependencies)
R R0
abcd
a ab bc ad dg h
�0 = ?
�1 = f8x R(x )! R0(x ; x )g
�2 = f8x R(x )! 9y R0(x ; y)g
�3 = f8x8y R(x ) ^R(y)! R0(x ; y)g
�4 = f9x9y R0(x ; y)g
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 20 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Results
Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .
This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.
Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P
4 forgeneral tgds!).
Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Results
Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .
This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.
Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P
4 forgeneral tgds!).
Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Results
Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .
This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.
Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P
4 forgeneral tgds!).
Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32
Joint work with Georg Gottlob (Oxford University).
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Results
Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .
This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.
Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P
4 forgeneral tgds!).
Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Outline
1 Introduction
2 A Probabilistic XML Data Model
3 Probing the Hidden Web
4 Wrapper Induction from Result Pages
5 Deriving Schema Mappings from Database Instances
6 Semantic Model
7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 22 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Conceptual Model
IsA ontology of concepts (simple DAG)
Thing
Person
Man Woman
Publication
Proceedings Article Book
n-ary typed roles
AuthorOf(Publication,Person)HasName(Person,Name)
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 23 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Conceptual Model
IsA ontology of concepts (simple DAG)
Thing
Person
Man Woman
Publication
Proceedings Article Book
n-ary typed roles
AuthorOf(Publication,Person)HasName(Person,Name)
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 23 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Semantic Representation of a Service
What is a service described by?
A n-uple of typed input parameters.
A complex (= nested) type of its output.
Semantic relations between inputs and outputs (Datalog-likedescription).
Definition (Complex types)S : set of concepts
T � S j<T ; : : : ;T>jT�
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Semantic Representation of a Service
What is a service described by?
A n-uple of typed input parameters.
A complex (= nested) type of its output.
Semantic relations between inputs and outputs (Datalog-likedescription).
Definition (Complex types)S : set of concepts
T � S j<T ; : : : ;T>jT�
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Semantic Representation of a Service
What is a service described by?
A n-uple of typed input parameters.
A complex (= nested) type of its output.
Semantic relations between inputs and outputs (Datalog-likedescription).
Definition (Complex types)S : set of concepts
T � S j<T ; : : : ;T>jT�
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Semantic Representation of a Service
What is a service described by?
A n-uple of typed input parameters.
A complex (= nested) type of its output.
Semantic relations between inputs and outputs (Datalog-likedescription).
Definition (Complex types)S : set of concepts
T � S j<T ; : : : ;T>jT�
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Services and Queries
ExampleService giving authors from publication titles
A* AuthorOf(A,P),HasTitle(P,T),Input(T)
ExampleQuery:<A,T*>* AuthorOf(A,P), Article(P),
HasTitle(P,T), KeywordOf(“xml”,P)
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 25 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Services and Queries
ExampleService giving authors from publication titles
A* AuthorOf(A,P),HasTitle(P,T),Input(T)
ExampleQuery:<A,T*>* AuthorOf(A,P), Article(P),
HasTitle(P,T), KeywordOf(“xml”,P)
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 25 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Managing Extensional Information
How to represent extensional information (i.e. documents) in thisformalism?
DefinitionA document is a service with no input.
Complex types: natural representation of a DTD.
(Disjunctions a|b simulated by (a?,b?)).
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 26 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Indexing and Querying
Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?
IssuesSubsumption of input/output parameters.
Missing input parameters.
Composition of Web Services.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Indexing and Querying
Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?
IssuesSubsumption of input/output parameters.
Missing input parameters.
Composition of Web Services.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Indexing and Querying
Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?
IssuesSubsumption of input/output parameters.
Missing input parameters.
Composition of Web Services.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Indexing and Querying
Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?
IssuesSubsumption of input/output parameters.
Missing input parameters.
Composition of Web Services.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Differences with Classical Database Querying
Three main differences:
Information can be queried only through views (Local As View).
Nested types.
Incomplete information.
Three sources of complexity!
Current direction of work: Using Magic sets techniques (for evaluationof Datalog programs) restricted to appropriate binding patterns.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Differences with Classical Database Querying
Three main differences:
Information can be queried only through views (Local As View).
Nested types.
Incomplete information.
Three sources of complexity!
Current direction of work: Using Magic sets techniques (for evaluationof Datalog programs) restricted to appropriate binding patterns.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32
Joint work with Serge Abiteboul.
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Differences with Classical Database Querying
Three main differences:
Information can be queried only through views (Local As View).
Nested types.
Incomplete information.
Three sources of complexity!
Current direction of work: Using Magic sets techniques (for evaluationof Datalog programs) restricted to appropriate binding patterns.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Outline
1 Introduction
2 A Probabilistic XML Data Model
3 Probing the Hidden Web
4 Wrapper Induction from Result Pages
5 Deriving Schema Mappings from Database Instances
6 Semantic Model
7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 29 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Web Service Semantic Interpretation Process
WWW HTML form
Analyzed form+ result pagesWeb service
AnalyzedWeb service
Service index
User
discovery
discovery probing
wrapper
induction
semantic analysis
indexing
query results
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 30 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Perspectives
Still a lot to do... In particular:
Answering queries using views on thesemantic model.
Continue work on automatic wrapperinduction, to get a form fullywrapped as a Web service.
Relation between schema mappinginduction and inductive logicprogramming.
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 31 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Other Works
Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).
Graph, Text and Web Mining
Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).
Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Other Works
Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).
Graph, Text and Web Mining
Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).
Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Other Works
Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).
Graph, Text and Web Mining
Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).
Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Other Works
Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).
Graph, Text and Web Mining
Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).
Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion
Other Works
Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).
Graph, Text and Web Mining
Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).
Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .
P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32