xyleme, january 2001 -- zurich1 a dynamic warehouse for the xml data of the web serge abiteboul...
TRANSCRIPT
![Page 1: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/1.jpg)
Xyleme, January 2001 -- Zurich 1
A Dynamic Warehouse for the XML data of the Web
Serge AbiteboulINRIA & Xyleme SA
[email protected] [email protected]://www-rocq.inria.fr/verso http://www.xyleme.com
![Page 2: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/2.jpg)
2
Organization
• The Web and XML• Xyleme• 1. Data Acquisition and Maintenance• 2. XML Repository• 3. Semantic Data Integration• 4. Query Processing• 5. Query Subscription• Conclusion
![Page 3: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/3.jpg)
Xyleme, January 2001 -- Zurich 3
The Web and XML
![Page 4: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/4.jpg)
4
The Web today
• Terabytes of data
• Private web: not publicly available pages
• Deep web: data hidden behind forms
• A lot of public pages– 1 billion in [06/2000] – several millions of servers
![Page 5: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/5.jpg)
5
The Web today
• Browsing• Search engines
– Google indexes more than 1 billion pages 11/00
– in: list of words
– out: sorted list of URLs• based on occurrence of
words in documents
• based on the link structure of the web
![Page 6: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/6.jpg)
6
The Web today
• Queries: keywords to retrieve URLs– Imprecise
– Query results cannot be directly processed
– Difficult to extract data of interest
• Applications: based on hand-made wrappers– Expensive
– Incomplete
– Short-lived, not adapted to the Web constant changes
![Page 7: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/7.jpg)
7
The Coming of XML
• HTML– comes from SGML– hypertext language– fixed number of tags– content and
presentation are mixed– very difficult to extract
data from a page
– old standard
• XML– also
– semistructured data
– not fixed
– not mixed
– very easy
– new standard
![Page 8: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/8.jpg)
8
HTML = Hypertext Language
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99
Information System
HTML
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.
Text + presentationWhere is the data ?
hard
![Page 9: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/9.jpg)
9
XML = Semistructured Data
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...
Information System
<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML
Data + StructureSemistructured: more flexible
easy
![Page 10: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/10.jpg)
10
XML : Tree Types
• Semantics and structure are in paths– product-table/product/reference– product-table/product/price
product
designation descriptionprice
reference
product-table
![Page 11: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/11.jpg)
11
XML
• Very active/noisy field - standards– schema (XML schema), stylesheet (XSL), resource
description (RDF...)– WML (wap), MathML, SMIL (multimedia), RSS
(news), RDF (metadata)...
• How fast will XML conquer the web? – so far rather slow (about 1% now of the visible
web; much more in intranets)– much faster since the arrival of Explorer 5.5
![Page 12: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/12.jpg)
Xyleme, January 2001 -- Zurich 12
A Dynamic Warehouse for the XML Data of the Web
Xyleme
![Page 13: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/13.jpg)
13
Xyleme
• Warehouse– Xyleme stores huge quantities of data (teraB)– Xyleme is not a search engine (only index) or a
mediator (only virtual data)
• XML– Xyleme is focused on XML, i.e., trees
• Dynamic– Xyleme is interested in data evolution/changes
![Page 14: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/14.jpg)
14
Xyleme
• September 1999: a group of researchers from – Inria Rocquencourt, Verso Group– U. of Mannheim, Database Group– U. of Orsay, IASI Group– CNAM, Vertigo Group
• September 2000: creation of a start-up
• November 2000: about 15 people
![Page 15: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/15.jpg)
15
Corporate Information Today
Web
Information System
manual searches using browsers
ad-hoc applications written by web-experts tailored for specific tasks and data.
I.e. inflexible and expensive
manual updates
![Page 16: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/16.jpg)
16
Corporate Information with Xyleme
Web
Information System
Repository
Query Engine
Xyleme-warehouse
Crawling & interpreting data
publishing
updatesqueries
searches
![Page 17: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/17.jpg)
17
Five Challenges
1. Data Acquisition and Maintenancediscover data of interest and maintain it up to date
2. Repositorystore this data and index it so that it can be
processed efficiently
3. Query Processingsupport efficiently an SQL-style query language
![Page 18: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/18.jpg)
18
Five Challenges - continued
4. Semantic IntegrationUnderstand DTD and tags, partition the Web into
semantic domains, provide a simple view of each domain
5. Change ControlMonitor the web and offer services such as Query
Subscription
![Page 19: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/19.jpg)
19
Challenges - continued
• Scale to the web
• Size of data: millions/billions of pages
• Size of index: terabytes
• Number of customers– thousands of simultaneous queries– millions of subscriptions
![Page 20: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/20.jpg)
20
Repository and Index Manager
Change Control
Query Processor
Semantic Module
User Interface
Xyleme Interface
Functional Architecture
-------------------- I N T E R N E T -----------------------
Web Interface
Acquisition& Crawler
Loader
![Page 21: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/21.jpg)
21
Architecture
• Cluster of PCs
• Developed with Linux and C++
• Communications– local: Corba– external: HTTP
• Distribution between autonomous machines
![Page 22: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/22.jpg)
22
Index Index Index
-------------------- I N T E R N E T -----------------------
Change Control andSemantic
Integration
Change Control andSemantic
Integration
ETHERNET
Repository Repository RepositorryRepository
Loader |Query Loader |Query
Architecture
Acquisition andMaintenance
Acquisition andMaintenance
![Page 23: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/23.jpg)
Xyleme, January 2001 -- Zurich 23
1. Data Acquisition and Maintenance
![Page 24: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/24.jpg)
24
Goals
• Discover XML pages on the web that are of interest for customers– For this crawl the web (HTML+XML)
• Maintain them up to date
• Do this under bounded resources
![Page 25: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/25.jpg)
25
Life Cycle of a page in Xyleme
• The URL of D is discovered as a link in another page (or published by a customer)
• The page scheduler decides to read D– The meta data of D is read
• type, last_date_update...
– The document D is loaded
• The document D is re(read) regularly
![Page 26: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/26.jpg)
26
Main Issues
• Loading of pages– we can load up to 5 millions of pages/day on a
standard PC– main cost is Internet connection
• Metadata management
• Page scheduling– decide which page to read or refresh next
![Page 27: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/27.jpg)
27
Metadata Management
• Example: management of the link matrix
– page i points to page j– for 1 billion URL, about 30 children/url– matrix has 30.109 edges (very sparse)
• For each page that is read, – find the IDs of the 30 children – 50 pages/second 1500 database calls/second
![Page 28: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/28.jpg)
28
Page Scheduling
• Decide which page to read next – discovery (read first) and refresh (read again)
Based on:• Importance of the page
– read often important pages– also used to order query results
• Change rate of the page– don’t read a page that is probably up-to-date
![Page 29: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/29.jpg)
29
Page Scheduling for Refresh
• Determine refresh frequency fi for each page i to minimize a cost function
• Minimize Under the constraint
1…N costi(fi) G 1…N fi
where costi(fi), penalty for page i, depends on the estimated importance and staleness of the page
![Page 30: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/30.jpg)
30
Cost Function
costi(fi), penalty for page i, depends on the estimated importance and staleness of the page
• Importance of the page – link structure– pub/sub
• Staleness of the data– penalty for being out of date– penalty for aging
![Page 31: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/31.jpg)
31
Evaluation of Change Rate• Based on the Last Date of Change
– provided by HTTP header of the page
– in general reliable but …
• Based on the number M of changes detected the last N times the pages was refreshed– limits: do not know the actual number of changes
First one more precise
![Page 32: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/32.jpg)
32
Page Importance: Link Structure
• Intuition: a page is important if many important pages reference it : fixpoint
• Link Matrix– M(i,j) if page i refers to page j– M is a 109 109 matrix– out(i) : the outdegree of page i
• Fixpoint– W0(k) = 1/N (initialization)– Wm(k) = i [M(i,k) * Wm-1(i)/out(i) ]
![Page 33: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/33.jpg)
33
Page Importance : Algorithm
Wm M(i,-)Wm-1(k)
+=
M(i,-) is stored as a listcomputation of Wm (line/line)for i = 1 to N do
[ read M(i,-) ; process the line ]
k
Wm(k)
out(k)
![Page 34: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/34.jpg)
34
Page Importance: Fixpoint
• Techniques for fixpoint convergence
• Some results – convergence is fast (OK after 10)– simple precision suffices– possible on a standard PC
• Distribution and incremental evaluation
![Page 35: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/35.jpg)
35
Page Importance: Refresh
Standard importancefor HTML/XMLpages
HTML pages areuseful onlyto discover XML
Taking pub/subinto account
circle = HTML square = XML
triangle = pub/sub
![Page 36: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/36.jpg)
Xyleme, January 2001 -- Zurich 36
2. XML Repository
![Page 37: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/37.jpg)
37
Storing XML documents
• Relational store (e.g., Oracle 8i)– binary long objects: not possible to access
directly elements– very typed data and Tables: efficient– otherwise: too many joins and inefficient
• Object database store (ODMG)– better adapted
• XML Native storage: Natix
![Page 38: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/38.jpg)
38
Natix Repository
• Goal– minimize I/O for direct access and scanning– efficient direct accesses using indexing– good compaction but not at the cost of access
• Efficient storage of trees – use fixed length storage pages – variable length records inside a page
• Main issue: tree balancing
![Page 39: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/39.jpg)
39
Tree Balancing
Record 1
Record 3Record 2
![Page 40: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/40.jpg)
40
Tree Balancing - continued
Large collections may useseveral records
![Page 41: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/41.jpg)
Xyleme, January 2001 -- Zurich 41
3. Semantic Data Integration
![Page 42: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/42.jpg)
42
Web Heterogeneity
• Semantic domains, e.g., cinema
• Many possible types for data in this domain, many DTDs
• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an
homogeneous database for this domain
1 domain = 1 abstract DTD
![Page 43: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/43.jpg)
43
Relationship is not visible unless one knows the relationships between story and tale.
Cluster DTDs and Documents
![Page 44: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/44.jpg)
44
Discover the Domains
Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.)
to obtain domains
cdtd1 .cdtd2 .cdtd3 .
adtd1
adtd2
adtd4
Many concrete DTDs
Fewer abstract DTDs
cdtd7 .cdtd8 .cdtd9 .cdtd10 .
cdtd4 .
cdtd5 .cdtd6 .
![Page 45: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/45.jpg)
45
Wordnet: Useful Relationships
• Synonyms One concept, two terms
•Hypernyms / Hyponyms two concepts linked through generalization/specialization - e.g., vehicle & car
•Meronyms / Holonyms two concepts linked through composition/inclusion - e.g., country & city
![Page 46: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/46.jpg)
46
Choose an Abstract DTD / Domain
• Automatically– The analysis of a cluster, leads to « clusters of
tags »– Use a thesaurus (e.g., Wordnet) to build a
hierarchy from the clusters of tags
• Manually– Performed by a domain expert
• Hybrid
![Page 47: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/47.jpg)
47
Mapping Concrete to Abstract
• For each concrete DTD in a domain, find how it relates to the abstract DTD:– Associate concrete tags to abstract tags using linguistic tools– Provide relationships between paths in the concrete and abstract DTD
E.g.: cdtd3/œuvre/nom/prénom and
adtd2/book/author/name/firstname
• Possibly automatic, manual or hybrid
![Page 48: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/48.jpg)
Xyleme, January 2001 -- Zurich 48
4. Query Processing
![Page 49: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/49.jpg)
49
Xyleme Query Language• Today: A mix of OQL and XQL• Tomorrow: the future W3C standard • Example
select product/name, product/price
from doc in catalogue,
product in doc/product
where product//components contains “flash”
and product/description contains “camera”
![Page 50: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/50.jpg)
50
Data Distribution
• Cluster of documents = physical collection of documents ( semantic domain)
Distribution
• Storage machine– in charge of a cluster of documents
• Index machine– index for a cluster
![Page 51: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/51.jpg)
51
Step 0: Indexing
• Standard inverted index– word documents that contain this word
• Xyleme index– word elements that contain this word
document + element identifier
• Goal: more work can be performed without accessing data
![Page 52: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/52.jpg)
52
Step 1: Localization
• Query on an abstract dtd
• Localization of machines that host concrete DTDs that will participate in the query
global query on abstract dtd
union of querieson local machines
local queries
catalogue/product/pricerelevant for
machine 56machine 45
![Page 53: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/53.jpg)
53
Step 2: Optimization
• Algebraic rewriting
• Linear search strategy based on simple heuristics– use in memory indexes– minimize communication
• Optimization of the global plan
• Optimization of the local plans
![Page 54: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/54.jpg)
54
Step 3: Execution
A plan usually consists of:1. parallel translation from abstract queries to concrete
patterns on the relevant index machines
2. parallel index scans to identify the relevant elements for a concrete pattern
3. parallel construction of resulting elements
4. pipeline evaluation (i.e., no intermediate data structure)
Note: 2. Requires smart indexes
![Page 55: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/55.jpg)
55
Execution: Abstract2Concrete
For each concrete pattern,
the local plan is optimized dynamically
for each concrete patternscan the element ids
&234 &177
for catalogue/product/pricescan relevant concrete pattern
d1//camera/price d2/product/cost d3/piano/price ...
![Page 56: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/56.jpg)
56
Element Identifiers
• Essential for query processing
• Identifier = (preorder rank/postorder rank)– X ancestor of Y <=>
pre(X) < pre(Y) and post(X) > post(Y)
– E.g., 2<5 and 4 >2 => (2,4) ancestor (5,2)
A B C
D E F
G
1
2
3 4
5
6
71
2
3
4
5
6
7
Text
![Page 57: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/57.jpg)
57
Patterns and Indexes
product
name description
“camera”
(d1, 12, 200), (d1, 201, 400)
(d1,1,11), (d1, 205,224) (d1,228, 237)
(d1, 229), (d2, 14)
Heuristics: to perform joins, start with the smallest cardinality (to minimize size intermediary results)
![Page 58: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/58.jpg)
Xyleme, January 2001 -- Zurich 58
5. Change Control
![Page 59: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/59.jpg)
59
The Web changes all the time
• Data acquisition + maintenance – keep the warehouse up-to-date
• Version management– representation and storage of changes
• Change monitoring– query subscription
![Page 60: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/60.jpg)
60
Versions
• Version some documents or some sites
• Version some continuous queries
continuous query: query that is evaluated regularly
get each Monday the list of movies showing in Paris
![Page 61: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/61.jpg)
61
Representing Versions: Deltas
• Version storage– current document– persistent identifiers for elements– description of changes - completed deltas
• Deltas are XML documents
• Changes can be processed like other data– exchanged: send me changes since June 1st!– queried: what are the products inserted since 2/1/99?
![Page 62: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/62.jpg)
62
Completed Delta
<delta><unit delta previousversion=“1” thistime=“2”/><delete xid=“11” xid-children=“(17-21)” ><Product> <Name>DVD</Name> <Price>500</Price></Product> </delete><move xid=“16” new_parent=“11” new_position=“2” old_parent=“11” old_position=“1” /><update xid=“3” new_value=“50” old_value=“100” /><update xid=“8” new_value=“100” old_value=“150” /></unit_delta><unit delta>...</unit delta>
</delta>
persistentidentifier
![Page 63: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/63.jpg)
63
Query Subscription• Users may subscribe to certain events, e.g.,
• changes in a page, a set of pages, • changes in pages from a particular semantic domain, containing some specific words or with a
particular DTD • changes of particular elements somewhere (new products in a catalog)
• Users may request to be notified • immediately at the time the event is detected• regularly, e.g., weekly• after a certain number of event detections
![Page 64: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/64.jpg)
64
Examplesubscription myPariscope
% what are the new movie entries in Pariscope site
monitoring newMovies
select URL
where URL extends www.pariscope.fr/movies/*
and new(self)
% manage the changes in the movies showing in Paris
continuous delta Showing
select ... from ... where
when daily
notify daily % send me a daily report
![Page 65: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/65.jpg)
65
Step 1: Atomic Event Detection
HTMLparser
XMLloader
metadatamanagerdocument
& alertsd/46
complexevent detection
atomic event 46: URL matches pattern www.inria.fr/*atomic event 67: XML documentcontains the tag painter
d/46,67
5 millions of pages/day
d
loading
![Page 66: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/66.jpg)
66
Step2: Complex Event Detection
HTMLparser
XMLloader
complexevent detection
comple event 12: 67 & 46 (XML document contains the tag painter and URL matches pattern www.inria.fr/*)
Millions of alerts of pages/dayMillions of subscriptions
![Page 67: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/67.jpg)
67
Step 3: Notification Processor
notificationprocessor
continuousqueries
Millions of notifications/day
complexevent detection
clock
triggers
alerts
notification/monitoring
notification/results
![Page 68: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/68.jpg)
Xyleme, January 2001 -- Zurich 68
Conclusion
![Page 69: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/69.jpg)
69
One Question Only
• The web is turning from a large collection of documents into a huge knowledge base
When will I be able to get
the precise knowledge I need?
Database + Knowledge Base + Linguistic + ...
![Page 70: Xyleme, January 2001 -- Zurich1 A Dynamic Warehouse for the XML data of the Web Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com](https://reader036.vdocuments.net/reader036/viewer/2022062618/5514689a5503462d4e8b5c69/html5/thumbnails/70.jpg)
Xyleme, January 2001 -- Zurich 70
Merci