understanding the hidden web - pierre senellart · p. senellart (inria & u. paris-sud)...

89
Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion Understanding the Hidden Web Pierre Senellart Max-Planck-Institut für Informatik, 22 August 2007 P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 1 / 32

Upload: others

Post on 22-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

Pierre Senellart

Max-Planck-Institut für Informatik, 22 August 2007

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 1 / 32

Page 2: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Hidden Web

Definition (Hidden Web, Deep Web, Invisible Web)The part of Web content not accessible from the hyperlinked structureof the World Wide Web. Typically: HTML forms, Web Services.

Size estimate (2001) : 500 times larger than the surface Web.

How to understand it and benefit from its content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32

Page 3: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Hidden Web

Definition (Hidden Web, Deep Web, Invisible Web)The part of Web content not accessible from the hyperlinked structureof the World Wide Web. Typically: HTML forms, Web Services.

Size estimate (2001) : 500 times larger than the surface Web.

How to understand it and benefit from its content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32

Page 4: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Hidden Web

Definition (Hidden Web, Deep Web, Invisible Web)The part of Web content not accessible from the hyperlinked structureof the World Wide Web. Typically: HTML forms, Web Services.

Size estimate (2001) : 500 times larger than the surface Web.

How to understand it and benefit from its content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 2 / 32

Page 5: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

PurposeIntensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

Page 6: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

PurposeIntensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

Page 7: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

PurposeIntensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

Page 8: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

PurposeIntensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

Page 9: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Understanding the Hidden Web

PurposeIntensional indexing of the Hidden Web.

High-level queries.

) a semantic search engine over the Hidden Web.

In a fully automatic, unsupervised, way!

Difficult and broad problem.

Use of domain knowledge (ontology, instances).

Example of the database publication domain.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32

Page 10: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW

HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 11: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pages

Web service

AnalyzedWeb service

Service index

User

discovery

discovery

probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 12: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 13: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 14: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 15: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 16: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 4 / 32

Page 17: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Imprecise Data and Imprecise Tasks

ObservationsMany needed tasks generate imprecise data, with some confidencevalue.

Need for a way to manage this imprecision, to work with itthroughout an entire complex process.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 5 / 32

Page 18: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Imprecise Data and Imprecise Tasks

ObservationsMany needed tasks generate imprecise data, with some confidencevalue.

Need for a way to manage this imprecision, to work with itthroughout an entire complex process.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 5 / 32

Page 19: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidenceQuery Results

+ confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 20: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidenceQuery Results

+ confidence

Topic crawler Form analyzer Inf. Extractor

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 21: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidence

Query Results+ confidence

Topic crawler Form analyzer Inf. Extractor

CrawledURLs

+ confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 22: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidence

Query

Results+ confidence

Topic crawler Form analyzer Inf. Extractor

Form URL?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 23: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidenceQuery

Results+ confidence

Topic crawler Form analyzer Inf. Extractor

URLs+ confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 24: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidence

Query Results+ confidence

Topic crawler Form analyzer Inf. Extractor

Analyzedform

+ confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 25: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidence

Query

Results+ confidence

Topic crawler Form analyzer Inf. Extractor

Form?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 26: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidenceQuery

Results+ confidence

Topic crawler Form analyzer Inf. Extractor

Person! �

+ confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 27: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

A Probabilistic XML Warehouse (Hidden Web)

Module 1 Module 2 Module 3

Update interface Query interface

Probabilistic XML Warehouse

Updatetransaction

+ confidence

Query Results+ confidence

Topic crawler Form analyzer Inf. Extractor

Person ! ISBN+ confidence

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 6 / 32

Page 28: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 7 / 32

Page 29: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data treesDetails: no attributes, no mixed content. . .

A

B C

D

6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

Page 30: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data treesDetails: no attributes, no mixed content. . .

A

B C

D

6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

Page 31: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data treesDetails: no attributes, no mixed content. . .

A

B C

D

6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

Page 32: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data treesDetails: no attributes, no mixed content. . .

A

B C

D

6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

Page 33: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Probabilistic Trees

Framework Unordered data treesDetails: no attributes, no mixed content. . .

A

B C

D

6=

A

B B C

D

(multiset semantics)

Sample space: Set of all such data trees.

Probabilistic tree (prob-tree): Representation of a discrete probabilitydistribution over this sample space.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 8 / 32

Page 34: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

The Prob-Tree Model

Data tree with event conditions (conjunction of probabilisticevents or negations of probabilistic events) assigned to each node.

Probabilistic events are boolean random variables, assumed to beindependent, with their own probability distribution.

A

Bw1;:w2

C

Dw2

Event Prob.w1 0:8w2 0:7

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 9 / 32

Page 35: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in anefficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

Page 36: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in anefficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

Page 37: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in anefficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

Page 38: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in anefficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

Page 39: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Features of the Prob-Tree Model

Well-defined possible world semantics.

Full expressive power, reasonable conciseness.

Possible to apply query and updates directly on prob-trees, in anefficient way.

Complexity study.

Implementation available.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 10 / 32

Page 40: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 11 / 32

Page 41: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Analyzing HTML Forms

Analyzing the structure of HTML forms.

ProblemAssociate to each relevant form field its corresponding domain concept.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 12 / 32

Page 42: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1 Build a context for each field:

label tag;id and name attributes;text immediately before the field.

2 Remove stop words, stem.3 Match this context with the concept names, extended with

WordNet.4 Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

Page 43: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1 Build a context for each field:

label tag;id and name attributes;text immediately before the field.

2 Remove stop words, stem.3 Match this context with the concept names, extended with

WordNet.4 Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

Page 44: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1 Build a context for each field:

label tag;id and name attributes;text immediately before the field.

2 Remove stop words, stem.3 Match this context with the concept names, extended with

WordNet.4 Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

Page 45: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

First Step: Structural Analysis

1 Build a context for each field:

label tag;id and name attributes;text immediately before the field.

2 Remove stop words, stem.3 Match this context with the concept names, extended with

WordNet.4 Obtain in this way candidate annotations.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 13 / 32

Page 46: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the

frequency distribution of c).3 Compare pages obtained by probing with the error page (by using

clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.

4 Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

Page 47: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the

frequency distribution of c).3 Compare pages obtained by probing with the error page (by using

clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.

4 Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

Page 48: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the

frequency distribution of c).3 Compare pages obtained by probing with the error page (by using

clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.

4 Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

Page 49: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the

frequency distribution of c).3 Compare pages obtained by probing with the error page (by using

clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.

4 Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

Page 50: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Avin Mittal (IIT Bombay).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Second Step: Confirm Annotations with Probing

For each field annotated with a concept c:

1 Probe the field with nonsense word to get an error page.2 Probe the field with instances of c (chosen representatively of the

frequency distribution of c).3 Compare pages obtained by probing with the error page (by using

clustering along the DOM tree structure of the pages), todistinguish error pages and result pages.

4 Confirm the annotation if enough result pages are obtained.

In practice, very good precision and good recall; but some limitationson the kind of forms that can be dealt with.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 14 / 32

Page 51: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 15 / 32

Page 52: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with researchers from mostrare (INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Query-answer Web Pages

Extract data from query-answer Web pages.

IssuesWhat part of the Web page contains the answer?

How to extract structured content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 16 / 32

Page 53: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with researchers from mostrare (INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Query-answer Web Pages

Extract data from query-answer Web pages.

IssuesWhat part of the Web page contains the answer?

How to extract structured content?

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 16 / 32

Page 54: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with researchers from mostrare (INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Automatic Wrapper Induction with DomainKnowledge

Annotate pages with knowledge domain (finite automatatechniques): Both imperfect and incomplete.

Use machine learning to generalize the result into a structuralextraction wrapper (Conditional Random Fields).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 17 / 32

Page 55: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with researchers from mostrare (INRIA Futurs).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Automatic Wrapper Induction with DomainKnowledge

Annotate pages with knowledge domain (finite automatatechniques): Both imperfect and incomplete.

Use machine learning to generalize the result into a structuralextraction wrapper (Conditional Random Fields).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 17 / 32

Page 56: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 18 / 32

Page 57: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Motivation

Analyzing the relations between different sources, or between a sourceand the domain knowledge.ProblemGiven two database instances I and J with different schemata, what isthe optimal description � of J with respect to I (with � a finite set offormulæ in some logical language)?

What does optimal implies:

Conciseness of description.Validity of facts predicted by I and �.Facts of J explained by I and �.

(Note the asymmetry between I and J ; context of data exchangewhere J is computed from I and �).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32

Page 58: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Motivation

Analyzing the relations between different sources, or between a sourceand the domain knowledge.ProblemGiven two database instances I and J with different schemata, what isthe optimal description � of J with respect to I (with � a finite set offormulæ in some logical language)?

What does optimal implies:

Conciseness of description.Validity of facts predicted by I and �.Facts of J explained by I and �.

(Note the asymmetry between I and J ; context of data exchangewhere J is computed from I and �).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32

Page 59: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Motivation

Analyzing the relations between different sources, or between a sourceand the domain knowledge.ProblemGiven two database instances I and J with different schemata, what isthe optimal description � of J with respect to I (with � a finite set offormulæ in some logical language)?

What does optimal implies:

Conciseness of description.Validity of facts predicted by I and �.Facts of J explained by I and �.

(Note the asymmetry between I and J ; context of data exchangewhere J is computed from I and �).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 19 / 32

Page 60: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Example (Tuple-Generating Dependencies)

R R0

abcd

a ab bc ad dg h

�0 = ?

�1 = f8x R(x )! R0(x ; x )g

�2 = f8x R(x )! 9y R0(x ; y)g

�3 = f8x8y R(x ) ^R(y)! R0(x ; y)g

�4 = f9x9y R0(x ; y)g

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 20 / 32

Page 61: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.

Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P

4 forgeneral tgds!).

Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

Page 62: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.

Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P

4 forgeneral tgds!).

Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

Page 63: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.

Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P

4 forgeneral tgds!).

Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

Page 64: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Georg Gottlob (Oxford University).

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Results

Description based on the minimum length of a repair of a formulathat is valid and explains all facts of J .

This optimality notion gives “intuitive” results for instancesderived from each other with simple operations.

Detailed complexity analysis for various languages and decisionproblems. Quite high in the polynomial hierarchy (up to �P

4 forgeneral tgds!).

Even for 8x18x28x3 R(x1; x2; x3)! R0(x1), computing the size ofthe minimal perfect repair is already NP-complete.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 21 / 32

Page 65: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 22 / 32

Page 66: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Conceptual Model

IsA ontology of concepts (simple DAG)

Thing

Person

Man Woman

Publication

Proceedings Article Book

n-ary typed roles

AuthorOf(Publication,Person)HasName(Person,Name)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 23 / 32

Page 67: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Conceptual Model

IsA ontology of concepts (simple DAG)

Thing

Person

Man Woman

Publication

Proceedings Article Book

n-ary typed roles

AuthorOf(Publication,Person)HasName(Person,Name)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 23 / 32

Page 68: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-likedescription).

Definition (Complex types)S : set of concepts

T � S j<T ; : : : ;T>jT�

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

Page 69: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-likedescription).

Definition (Complex types)S : set of concepts

T � S j<T ; : : : ;T>jT�

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

Page 70: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-likedescription).

Definition (Complex types)S : set of concepts

T � S j<T ; : : : ;T>jT�

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

Page 71: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Semantic Representation of a Service

What is a service described by?

A n-uple of typed input parameters.

A complex (= nested) type of its output.

Semantic relations between inputs and outputs (Datalog-likedescription).

Definition (Complex types)S : set of concepts

T � S j<T ; : : : ;T>jT�

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 24 / 32

Page 72: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Services and Queries

ExampleService giving authors from publication titles

A* AuthorOf(A,P),HasTitle(P,T),Input(T)

ExampleQuery:<A,T*>* AuthorOf(A,P), Article(P),

HasTitle(P,T), KeywordOf(“xml”,P)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 25 / 32

Page 73: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Services and Queries

ExampleService giving authors from publication titles

A* AuthorOf(A,P),HasTitle(P,T),Input(T)

ExampleQuery:<A,T*>* AuthorOf(A,P), Article(P),

HasTitle(P,T), KeywordOf(“xml”,P)

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 25 / 32

Page 74: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Managing Extensional Information

How to represent extensional information (i.e. documents) in thisformalism?

DefinitionA document is a service with no input.

Complex types: natural representation of a DTD.

(Disjunctions a|b simulated by (a?,b?)).

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 26 / 32

Page 75: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?

IssuesSubsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

Page 76: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?

IssuesSubsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

Page 77: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?

IssuesSubsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

Page 78: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Indexing and Querying

Given a query, represented as an analyzed Web service, how to knowwhich known Web services to query?

IssuesSubsumption of input/output parameters.

Missing input parameters.

Composition of Web Services.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 27 / 32

Page 79: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Differences with Classical Database Querying

Three main differences:

Information can be queried only through views (Local As View).

Nested types.

Incomplete information.

Three sources of complexity!

Current direction of work: Using Magic sets techniques (for evaluationof Datalog programs) restricted to appropriate binding patterns.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32

Page 80: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Differences with Classical Database Querying

Three main differences:

Information can be queried only through views (Local As View).

Nested types.

Incomplete information.

Three sources of complexity!

Current direction of work: Using Magic sets techniques (for evaluationof Datalog programs) restricted to appropriate binding patterns.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32

Page 81: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Joint work with Serge Abiteboul.

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Differences with Classical Database Querying

Three main differences:

Information can be queried only through views (Local As View).

Nested types.

Incomplete information.

Three sources of complexity!

Current direction of work: Using Magic sets techniques (for evaluationof Datalog programs) restricted to appropriate binding patterns.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 28 / 32

Page 82: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Outline

1 Introduction

2 A Probabilistic XML Data Model

3 Probing the Hidden Web

4 Wrapper Induction from Result Pages

5 Deriving Schema Mappings from Database Instances

6 Semantic Model

7 ConclusionP. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 29 / 32

Page 83: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Web Service Semantic Interpretation Process

WWW HTML form

Analyzed form+ result pagesWeb service

AnalyzedWeb service

Service index

User

discovery

discovery probing

wrapper

induction

semantic analysis

indexing

query results

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 30 / 32

Page 84: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Perspectives

Still a lot to do... In particular:

Answering queries using views on thesemantic model.

Continue work on automatic wrapperinduction, to get a form fullywrapped as a Web service.

Relation between schema mappinginduction and inductive logicprogramming.

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 31 / 32

Page 85: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

Page 86: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

Page 87: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

Page 88: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32

Page 89: Understanding the Hidden Web - Pierre Senellart · P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 3 / 32 Introduction Prob-Trees Probing Wrappers

Introduction Prob-Trees Probing Wrappers Schema Mappings Sem. Model Conclusion

Other Works

Data Warehousing Extraction of information from the Web, mailinglists. . . to build a warehouse of sociological data (withvarious people).

Graph, Text and Web Mining

Similarity between nodes in graphs; application tosynonym extraction (with Vincent Blondel, fromUCL).Related nodes in a graph; application to Wikipedia(with Yann Ollivier, from ÉNS Lyon).PageRank prediction (with Michalis Vazirgiannis).

Machine Translation Close relations with SYSTRAN; XML documentprocessing, statistical and rule-based machine translation,multilingual authoring. . .

P. Senellart (INRIA & U. Paris-Sud) Understanding the Hidden Web MPI-Inf., 2007/08/22 32 / 32