1 web monitoring 20021 issues in monitoring web data serge abiteboul inria and xyleme...

69
Web Monitoring 2002 1 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme [email protected]

Upload: haley-kelley

Post on 26-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 1

1

Issues in Monitoring Web Data

Serge Abiteboul

INRIA and Xyleme

[email protected]

Page 2: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 2

2

Organization

1. Introduction– What is there to monitor?– Why monitor?

2. Some applications of web monitoring3. Web archiving

– An experience: the archiving of the French web– Page importance and change frequency

4. Creation of a warehouse using web resources– An experience: the Xyleme Project– Monitoring in Xyleme

5. Queries and monitoring6. Conclusion

Page 3: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 3

3

1. Introduction

Page 4: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 4

4

• Billions of pages + millions of servers• Query = keywords to retrieve URLs

– Imprecise; query results are useless for further processing

• Applications: based on ad-hoc wrapping– Expensive; incomplete; short-lived, not adapted to the Web constant

changes

• Poor quality– Cannot be trusted: spamming, rumors…

– Often stale

– Our vision of it often out-of-date

• Importance of monitoring

The Web Today

Page 5: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 5

5

The HTML Web Structure

Source : IBM, AltaVista, Compaq

Page 6: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 6

6

Source: searchenginewatch.com

HTML: Percentage covered by Crawlers

Page 7: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 7

7

So much for the world knowledge…

• Most of the web is not reached by crawlers (hidden web)

• Some of the public HTML pages are never read

• Most of what is on the web is junk anyway

• Our knowledge of it may be stale

• Do not junk the techno – improve it!

Page 8: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 8

8

What is there to monitor?

• Documents: HTML but also doc, pdf, ps…• Many data exchange formats such as asn1,

bibtex…• New official data exchange format: XML• Hidden web: database queries behind forms or

scripts• Multimedia data: ignored here• Public vs. private (Intranet or Internet+passwd)• Static vs. dynamic

Page 9: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 9

9

What is changing?

• XML is coming– Universal data exchange format– Marriage of document and database worlds– Standard query language: XQuery– Quickly growing on Intranet and very slowly on public

web (less than 1%)• Web services are coming

– Format for exporting services– Format for encapsulating queries

• More semantics to be expected– RDF for data– WSDL+UDDI for services

Page 10: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 10

10What is not changing fast

or even getting worse• Massive quantity of data – most of it junk• Lots of stale data• Very primitive HTML query mechanisms

(keywords)• No real change control mechanism soon

– Compare database queries (fresh data) with web search engines (possibly stale)

– Compare: database triggers (based on push) to web notification services (most of the times based on pull/refresh)

Page 11: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 11

11

The need to monitor the web

• The web changes all the time

• Users are often as interested in changes as by data – new products, new press articles, new price…

• Discover new resources

• Keep our vision of the web up-to-date

• Be aware of changes that may be of interest, have impact on our business

Page 12: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 12

12

Analogy: databases

• Databases– Query: instantaneous vision of data– Trigger: alert/notification of some changes of interest

• Web– Query: need monitoring to give correct answer– Monitoring: to support alert/notifications of changes of

interest

Page 13: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 13

13

Web vs. database monitoring

• Quantity of data: larger on the web• Knowledge of data

– structure and semantics known in databases

• Reliability and availability– High in databases; null on the web

• Data granularity– Tuple vs. page in HTML or element in XML

• Change control– Databases: support from data sources/triggers– Web: no support; pull only in general

Page 14: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 14

14

2. Some applications ofweb monitoring

Page 15: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 15

15

Comparative shopping

• Unique entry point to many catalogs• Data integration problem• Main issue: wrapping of web catalogs

– Semi-automatic so limited to a few sites– Simpler and towards automatic with XML

• Alternatives– Mediation when data change very fast

• prices and availability of plane tickets

– Warehousing otherwise need to monitor changes

Page 16: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 16

16

Web surveillance

• Applications– Anti-criminal and anti-terrorist intelligence, e.g.,

detecting suspicious acquisition of chemical products– Business intelligence, e.g., discovering potential

customers, partners, competitors

• Find the data (crawl the web)• Monitor the changes

– new pages, deleted pages, changes in a page

• Classify information and extract data of interest– Data mining, text understanding, knowledge

representation and extraction, linguistic… Very AI

Page 17: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 17

17

Copy tracking

• Example: a press agency wants to check that people are not publishing copies of their wires without paying

Flow of candidatedocuments Slice the

document

Query to search engineOr specific crawl + pre-filter

Filterdetection

1 2 3

Page 18: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 18

18

Web archiving

• We will discuss an experience in archiving the French web

Page 19: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 19

19Creation of a data warehouse with

resources found of the web

• We will discuss some work in the Xyleme project on the construction of XML warehouses

Page 20: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 20

20

3. Web archiving

An experience towards the archiving of the French web with

Bibliothèque Nationale de France

Page 21: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 21

21

Dépôt légal (legal deposit)

• Books are archived since 1537, a decision by King Francois the 1st

• The Web is an important and valuable source of information that should also be archived

• What is different? – Number of content providers: 148000 sites vs. 5000 editors– Quantity of information: millions of pages + video/audio– Quality of information: lots of junk– Relationship with editors: freedom of publication vs. traditional

‘push’ model– Updates and changes occur continuously– The perimeter is unclear: what is the French web?

Page 22: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 22

22

Goal and Scope

• Provide future generations with a representative archive of the cultural production

• Provide material for cultural, political, sociological studies

• The mission is to archive a wide range of material because nobody knows what will be of interest for future research

• In traditional publication, publishers are filtering contents. No filter on the web

Page 23: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 23

23

Similar Projects

• The Internet Archive www.archive.org– The Wayback machine– Largest collection of versions of web pages

• Human selection based approach– select a few hundred sites and choose a periodicity of archiving– Australia and Canada

• The Nordic experience– Use robot crawler to archive a significant part of the surface web– Sweden, Finland, Norway

• Problems encountered:• Lack of updates of archived pages between two snapshots• The hidden Web

Page 24: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 24

24

Orientation of our experiment

• Goals:– Cover a large portion of the French web

• Automatic content gathering is necessary– Adapt robots to provide a continuous archiving facility

• Have frequent versions of the sites, at least for the most “important” ones

• Issues:– The notion of “important’’ sites– Building a coherent Web archive– Discover and manage important sources of deep Web

Page 25: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 25

25

First issue: the perimeter

• The perimeter of the French Web: contents edited in France

• Many criteria may be used:– The French language but many French sites use English (e.g.

INRIA) + many French-speaking sites are from other French speaking countries or regions (e.g. Quebec)

– Domain Name or resource locators; .fr sites, but many are also in .com or .org

– Address of the site: physical location of the web servers or address of the owner

• Other criteria than the perimeter– Little interest in commercial sites– Possibly interest in foreign sites that discuss French issues

• Pure automatic does not work involve librarians

Page 26: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 26

26Second issue:

Site vs. Page archiving• The Web:

– Physical granularity = HTML pages– The problem is inconsistent data and links

• Read page P; one week later read pages pointed by P – may not exist anymore

– Logical granularity?

• Snapshot view of a web site– What is a site?

• INRIA is www.inria.fr + www-rocq.inria.fr…• www.multimania.com is the provider of many sites

– There are technical issues (rapid firing, …)

Page 27: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 27

27

Importance of data

Page 28: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 28

28

What is page importance?

• “Le Louvre” homepage is more important than an unknown person’s homepage

• Important pages are pointed by:– Other important pages– Many unimportant pages

• This leads to Google definition of PageRank– Based on the link structure of the web– used with remarkable success by Google for ranking

results• Useful but not sufficient for web archiving

Page 29: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 29

29

Page Importance

• Importance– Link matrix L– In short, page importance is the fixpoint X of the

equation L*X = X– Storing the Link matrix and computing page

importance uses lots of resources

• We developed a new efficient technique to compute the fixpoint – Without having to store the Link matrix– Technique adapts to automatically to changes

Page 30: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 30

30

Site vs. pages

• Limitation of page importance– Google page importance works well when links have

a strong semantic– More and more web pages are automatically

generated and most links have little semantics• More limitation

– Refresh at the page level presents drawbacks

• So we also use link topology between sites and not only between pages

Page 31: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 31

31

Experiments

• Crawl– We used between 2 to 8 PCs for Xyleme crawlers for 2 months– Discovery and refresh based on page importance

• Discovery– We looked at more than 1.5 billion (most interesting) web pages– We discovered more than 15 million *.fr pages – about 1.5% of

the web– We discovered 150 000 *.fr sites

• Refresh– Important pages were refreshed more often – Takes into account also the change rate of pages

• Analysis of the relevance of site importance for librarians– Comparison with ranking by librarians– Strong correlation with their rankings

Page 32: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 32

32Issues and on going work:

Other criteria for importance

• Take into account indications by archivists– They know best -- man-machine-interface issue

• Use classification and clustering techniques to refine the notion of site

• Frequent use of infrequent words– Find pages dedicated to specific topics

• Text Weight– Find pages with text content vs. raw data pages)

• Others

Page 33: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 33

33

4. Creation of a Warehouse from Web data

The Xyleme Project

Page 34: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 34

34

Xyleme in short

• The Xyleme project– Initiated at INRIA– Joint work with researchers from Orsay, Mannheim

and CNAM-Paris universities

• The Xyleme company: www.xyleme.com– Started in 2000– About 30 people– Mission: Deliver a new generation of content

technologies to unlock the potential of XML

• Here: focus on the Xyleme project

Page 35: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 35

35

Goal of the Xyleme project

• Focus is on XML data (but also handle HTML)

• Semantic– Understand tags, partition the Web into semantic

domains, provide a simple view of each domain

• Dynamicity– Find and monitor relevant data on the web– Control relevant changes in Web data

• XML storage, index and queries– Manage efficiently millions of XML documents and

process millions of simultaneous queries

Page 36: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 36

36

Corporate information environment with Xyleme

Web

Information System

Repository

Query Engine

Xyleme Server

Crawling & interpreting data

publishing Systematic updating

queries

searches

XML Repository

Page 37: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 37

37

XML in short

• Data exchange format• eXtensible Mark-up Language

(child of SGML)• Promoted by W3C and major industry

players• XML document: ordered labeled tree• Other essential gadgets: unicode,

namespaces, attributes, pointers, typing (XML schema)…

Page 38: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 38

38

XML magic in short

• Presentation is given elsewhere (style-sheet)• Semantic and structure are provided by labels• So it is easy to extract information

• Universal format understood by more and more softwares (e.g., exported by most databases + read by more and more editors)

• More and more tools available

Page 39: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 39

39

It is easy to extract information

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99

Information System

< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>

XML

Ref product/reference

Name product/designation

Price product/price

Page 40: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 40

40

4.1 Xyleme:Functionality and architechture

Page 41: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 41

41

The goal of Xyleme project: XML Dynamic Datawarehouse

• Many research issues– Query Processor– Semantic Classification– Data Monitoring– Native Storage– XML document Versionning– XML automatic or user driven acquisition– Graphical User Interface through the Web

Page 42: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 42

42

Repository and Index Manager

Change Control

Query Processor

Semantic Module

User Interface

Xyleme Interface

Functional Architecture

Acquisition& Crawler

-------------------- I N T E R N E T -----------------------

Web Interface

Loader

Page 43: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 43

43

Index Index Index

InterfaceChange | Semantic

Global Query

InterfaceChange | Semantic

Global Query

-------------------- I N T E R N E T -----------------------

ETHERNET

Web InterfaceCrawler

Global Loader

DTDi,DTDjXML DOC

extent

DTDk,DTDlXML DOC

extent

DTDm, ..XML DOC

extent

DTDp ...XML DOC

extent

Loader |Query|VersionRepository

Loader |Query|VersionRepository

Architecture

Page 44: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 44

44

Prototype main choices

• Network of Linux PCs

• C++ on the server side

• Corba for communications between PCs

• HTTP + SOAP for communications for external communications– Exception for query processing

Page 45: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 45

45

Scaling

Parallelism based on• Partitioning

– XML documents– URL table– Indexes (semantic partitioning)

• Memory replication • Autonomous machines (PCs)

– Caches are used for data flow

Page 46: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 46

46

4.2 Xyleme:Data Acquisition

Page 47: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 47

47

Data Acquisition

• Xyleme crawler visits the HTML/XML web• Management of metadata on pages• Sophisticate strategy to optimize network bandwidth

– importance ranking of pages

– change frequency and age of pages

– publications (owners) & subscriptions (users)

• Each crawler visits about 4 million pages per day• Each index may create index for 1 million pages per day

Page 48: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 48

48

4.3 Xyleme:Change Control

Page 49: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 49

49

Change Management

• Monitoring – subscriptions– continuous queries– versions

• The Web changes all the time

• Data acquisition– automatic and via publication

Page 50: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 50

50

Subscription

• They may request to be notified • at the time the event is detected by Xyleme• regularly, e.g., once a week

• Users can subscribe to certain events, e.g., • changes in all pages of a certain DTD or of a certain

semantic domain• insertion of a new product in a particular catalog or in

all catalogs with a particular DTD

Page 51: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 51

51

Continuous Queries

• Queries asked regularly or when some events are detected– send me each Monday the list of movies in

Pariscope– send me each Monday the list of new movies

in Pariscope– each time you detect that a new member is

added to the Stanford DB-group, send me their lists of publications from their homepages

Page 52: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 52

52

Versions and Deltas

• Store snapshots of documents• For some documents, store changes (deltas)

– storage: last version + sequence of deltas – complete delta: reconstruct old versions– partial delta: allow to send changes to the user

and allow refresh– Deltas are XML documents – so changes can be queried as standard data

• Temporal queries– List of products that were introduced in this

catalog since January 1st 2002

Page 53: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 53

53

The Information Factory

loaderssubscription

processorsend

notification

continuousqueries

timedocuments and deltas

Repositoryversionqueries

results

changesdetection

Web

Page 54: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 54

54

Results • Very efficient XML Diff algorithm

– compute difference between consecutive versions

• Representation of deltas based on an original naming scheme for XML elements– one element is assigned a unique identifier for its entire life– compact way of representing these IDs

• Efficient versioning mechanism

Page 55: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 55

55

Results

• Sophisticate monitoring algorithm– Detection of simple patterns (conjunctions) at

the document level– Detection of changes between consecutive

versions of the same documents

• Scale to dozens of crawlers loading millions of documents per day for a single monitor

Page 56: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 56

56

Issues: languages for monitoring

• In the spirit of temporal languages for relational databases

• But– Data model is richer (trees vs. tables)– Context is richer: versions, continuous

queries, monitoring of data streams…

Page 57: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 57

57

4.4 Xyleme:Semantic Data Integration

Page 58: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 58

58

Data Integration

• One application domain -- Several schemas– heterogeneous vocabulary and structure

• Xyleme Semantic Integration –è – gives the illusion that the system maintains an homogeneous database for this domain – abstracts a set of DTDs into a hierarchy of pertinent terms for a particular domain (business,

culture, tourism, biology, …)  

Page 59: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 59

59

Technology in short

• Cluster DTDs into application domains• For an application domain – semi-

automatically– Organize tags into a hierarchy of concepts

using thesauri such as Wordnet and other linguistic tool

– This provides the abstract DTD for the particular domain

– Generate mappings between concrete DTDs and the abstract one

Page 60: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 60

60

4.5 Xyleme:Query Processing

Page 61: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 61

61

Xyleme Query Language

A mix of OQL and XQL, will use the W3C standard when there will be one.

Select product/name, product/price

From doc in catalogue,

product in doc/product

Where product//components contains “flash”

and product/description contains “camera”

Page 62: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 62

62

Principle of Querying

catalogue/product/price d1//camera/price d2/product/cost

catalogue/product/description d1//camera/description

d2/product/info, ref d2/description

query on abstract dtd Union of concrete queries(possibly with Joins)

MAPPINGS between concrete and abstract DTDs

Page 63: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 63

63

Query Processing1. Partial translation, from abstract to concrete, to identify “machines” with relevant data

2. Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication

3. Decomposition into local physical subplans and installation

4. Execution of plans

5. If needed, Relaxation

Page 64: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 64

64

Query processing

• Essential use of a smart index combining full-text and structure

Page 65: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 65

65

4.6 Xyleme:Repository

Page 66: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 66

66

Storage System

• Xyleme store– efficient storage of trees in variable length

records within fixed length pages

• Balancing of tree branches in case of overflow– minimize the number of I/O for direct access

and scanning– good compromise : compaction / access time

Page 67: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 68

68

5. Conclusion

Page 68: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 69

69

Web monitoring

• Very challenging problem– Complexity due to the volume of data and the

number of users– Complexity due to heterogeneity– Complexity due to lack of cooperation from

data sources

• Many issues to investigate

Page 69: 1 Web Monitoring 20021 Issues in Monitoring Web Data Serge Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr

Web Monitoring 2002 70

70

New directions

• Active web sites– Friendly sites willing to cooperate– Web services provide the infrastructure– Support for triggers

• Mobile data– Web sites on mobile devices– Issues of availability (device unplugged)– Issues in synchronization– Geography dependent queries