1 st of june, 2011 carlos aldeias gabriel david cristina ribeiro
TRANSCRIPT
1st of June, 2011
DWXMLA PRESERVATION
FORMAT FOR DATA WAREHOUSES
Carlos Aldeias
Gabriel David
Cristina Ribeiro
DWXML – A Preservation Format for Data Warehouses 2/46Carlos Aldeias
1st of June, 2011
OUTLINE
Introduction
Motivation
Data Warehouse Preservation
DWXML Definition
DBPreserve Suite Application
Conclusions
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 3/46Carlos Aldeias
1st of June, 2011
INTRODUCTION
Companies, institutions and governments rely increasingly on On-Line Analytical Processing (OLAP) Major benefits for analysis and decision support Selective extraction and analysis of data from different
perspectives Most systems are structured using Data Warehouses
OLAP types: ROLAP – Relational OLAP MOLAP – Multidimensional OLAP HOLAP – Hybrid OLAP
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 4/46Carlos Aldeias
1st of June, 2011
PRESERVATION CONCERN Data Warehouse as a digital object
Different from conventional digital objects: data warehouses are complex digital objects
They are based on a dimensional model: Star schema, facts, dimensions with levels and hierarchies, bridges and datamarts
They are often implemented on relational databases (ROLAP), keeping data in tables, views and schemas
Data vs. Metadata The primary data stored into tables must be archived as well as the
metadata, both at the relational and dimensional levels
Technologies are evolving continually Data Warehouses created with today’s technologies may not be
accessible with the upcoming versions
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 5/46Carlos Aldeias
1st of June, 2011
RELEVANT WORKS
InterPARES
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 6/46Carlos Aldeias
1st of June, 2011
DATABASES / DATA WAREHOUSESIS DATA
WAREHOUSE JUST
ANOTHER NAME FOR DATABASE
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 7/46Carlos Aldeias
1st of June, 2011
DATA WAREHOUSE:DIMENSIONAL MODEL CONCEPTS
Star schema
Fact table
Fact Measure
Bridge table Dimension
Hierarchy
Join Key
Level
Level key
Attribute
Datamart Snowflake schema
Sub-dimension
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 8/46Carlos Aldeias
1st of June, 2011
DBPRESERVE PROJECT
DBPreserve
• Long-term preservation of Institutional Electronic Records and Databases
• Archive databases ensuring their long-term accessibility
Dimensional
Model
• Data warehouse for archives
model definition
• Migration from a relational model
to a dimensional model
OAIS
Model
• Modeling System according to OAIS
• Long-term preservation using XML
• Independence from technology
• Portability between systems
[Rahman, 2010]
[CCSDS, 2002]
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 9/46Carlos Aldeias
1st of June, 2011
DATA WAREHOUSE PRESERVATION Existing preservation approaches don´t comply
with data warehouse preservation requirements
Regarding data warehouses implemented with relational database technologies, some efforts can be reused
Although, they still lack an important metadata layer that describes the data warehouse structure and entities
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 10/46Carlos Aldeias
1st of June, 2011
DATA WAREHOUSE METADATA Star Schema - fact table is surrounded by
dimensional tables Bridge Tables
Example from a case study, implemented using Oracle Database11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 11/46Carlos Aldeias
1st of June, 2011
DW METADATA – FACT TABLES A fact table is the center of a star schema
Consists of facts of a business process Facts Measures :
○ ADDITIVE○ NON ADDITIVE ○ SEMI ADDITIVE
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 12/46Carlos Aldeias
1st of June, 2011
DW METADATA – DIMENSIONS They give the context and meaning to the facts
Represent the relevant vectors of analysis of the business process facts
Usually represented by one or more dimensional tables Levels Hierarchies Attributes
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 13/46Carlos Aldeias
1st of June, 2011
DW METADATA: CREATE DIMENSION Project’s case study implements the dimensional
model using Oracle Database 11gCREATE DIMENSION class_dimLEVEL class IS (IPDW_CLASS.CLASS_ID)LEVEL course IS (IPDW_CLASS.COURSE_ID)HIERARCHY class_rollup(class CHILD OFcourse)
ATTRIBUTE class DETERMINES(IPDW_CLASS.CODE, IPDW_CLASS.ACRONYM,IPDW_CLASS.NAME, IPDW_CLASS.TYPE)
ATTRIBUTE course DETERMINES(IPDW_CLASS.COUR_CODE,
IPDW_CLASS.COUR_ACRONYM,IPDW_CLASS.COUR_NAME,
IPDW_CLASS.COUR_TYPE,IPDW_CLASS.COURSE_PREVIOUS_COD);
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 14/46Carlos Aldeias
1st of June, 2011
DW METADATA – BRIDGE TABLE Bridge tables are used to resolve a many to
many relationship between a fact and a dimension
Also used to flatten out a hierarchy in a dimension
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 15/46Carlos Aldeias
1st of June, 2011
DW METADATA – SNOWFLAKE Snowflake schema is similar to a star
schema, but one or more dimension tables are partially normalized Sub-dimensions
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 16/46Carlos Aldeias
1st of June, 2011
DW METADATA – DATAMART
Subset of a data warehouse
Typically, a set of star and snowflake schemas
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 17/46Carlos Aldeias
1st of June, 2011
DATA WAREHOUSE PRESERVATION FORMAT PROPOSAL Analysis of relational database
preservation formats DBML (Database Markup Language) [Ramalho, 2007]
SIARD Format (Software Independent Archiving of Relational Databases) [SFA, 2008]
Analysis on Data Warehouse XML representation XCube (for multidimensional schemas) [Hummer, 2003]
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 18/46Carlos Aldeias
1st of June, 2011
DATA WAREHOUSE PRESERVATION FORMAT PROPOSAL Decision on extending the SIARD Format
Separates metadata from primary data Segmented representation of primary data Ready to use application that creates a SIARD format from
a relational database (MSAccess, MSSQL and Oracle)
Add a metadata layer regarding the dimensional model perspective Extracting data warehouse metadata from data dictionary Defining a XML structure for the dimensional model Embedding it into the SIARD format
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 19/46Carlos Aldeias
1st of June, 2011
RELATIONAL DATABASE PRESERVATION WITH SIARD Header folder for metadata Content folder for primary
data Organized in directories Single XML file for each
data table
SIARD Suite – set of tools formigrating, editing and reactivating databases
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 20/46Carlos Aldeias
1st of June, 2011
EXTENDING SIARD FORMAT Add a XML file with the extra metadata
layer for data warehouse characterization Add the corresponding
schema
No action on the primary data
Data in the DW ingested to the SIARD Suite as a relational database
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 21/46Carlos Aldeias
1st of June, 2011
DWXML DEFINITIONIdentifier Opt Description
version DWXML format version
stars List of stars in the data warehouse
dimensions List of dimensions, dimensional tables and views in the data warehouse
schemas List of schemas in the data warehouse
datamarts List of datamarts in the data warehouse
dwBinding Additional metadata for data warehouse connection description and DWXML file generation
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 22/46Carlos Aldeias
1st of June, 2011
DWXML - STARSIdentifier Opt Description
name The name of the star
description The meaning and content of the star
factTable Fact table of the star
ray Ray of the star connecting the fact table with a bridge table and/or a dimension (referencing by schema and name)
table List of extra tables to accommodate unexpected special cases
view List of extra views to accommodate unexpected special cases
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 23/46Carlos Aldeias
1st of June, 2011
DWXML - FACTTABLEIdentifier Opt Description
schema The name of the schema of the fact table
name The name of the fact table
joinColumns List of columns used in the join between a fact table and a bridge table
facts List of facts in the fact table
type The type of the fact table (CUMULATIVE or SNAPSHOT)
grain The grain of the fact, the meaning and content of a row in the fact table
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 24/46Carlos Aldeias
1st of June, 2011
DWXML - DIMENSIONIdentifier Opt Description
schema The name of the schema of the dimension
name The name of the dimension
description The meaning and content of the dimension
levels List of level in the dimension
hierarchies List of hierarchies in the dimension
attributes List of attributes in the dimension
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 25/46Carlos Aldeias
1st of June, 2011
DWXML – LEVEL AND ATTRIBUTE
Identifier Opt Description
name The name of the level
description The meaning and content of the level
levelKey The column or list of columns as key of the level
Identifier Opt Description
attributeName The name of the attribute
level The level or list of level referenced by the attribute
name The name of the level
determines The column or list of columns that defines the level
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 26/46Carlos Aldeias
1st of June, 2011
DWXML –HIERARCHYIdentifier Opt Description
name The name of the hierarchy
description The meaning and content of the hierarchy
levels List of levels (descending order by appearance) in the hierarchy
joinKey The column or list of columns that are used to join the levels or refers to a dimension, while levels are in different dimensions
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 27/46Carlos Aldeias
1st of June, 2011
DWXML –DATAMARTIdentifier Opt Description
name The name of the datamart
description The meaning and content of the datamart
stars List of stars that defines the datamart
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 28/46Carlos Aldeias
1st of June, 2011
DWXML –SCHEMAIdentifier Opt Description
name The name of the schema
description The meaning and content of the schema
folder The name of the folder in the SIARD format
tables List of tables of the schema and their definition
views List of views of the schema and their definition
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 29/46Carlos Aldeias
1st of June, 2011
DWXML –TABLEIdentifier Opt Description
name The name of the table
description The meaning and content of the table
folder The name of the folder in the SIARD format
nRows Number of rows of the table
columns List of columns of the table and their definition
primaryKey The primary key of the table
foreignKeys List of foreign keys of the table and their definition
role The role of the table in the dimensional model
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 30/46Carlos Aldeias
1st of June, 2011
DWXML –VIEWIdentifier Opt Description
name The name of the view
description The meaning and content of the view
columns List of columns of the view
query The query that represents the view
role The role of the view in the dimensional model
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 31/46Carlos Aldeias
1st of June, 2011
DWXML –SAMPLE<?xml version="1.0" encoding="UTF-8"?><dwxml version="1.0" xsi:noNamespaceSchemaLocation="dw.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <stars> <star> <name>IPDW_ANSWERS_STAR</name> <description>Star related to the answers</description> <factTable> <schema>CALDEIAS</schema> <name>IPDW_ANSWERS</name> <facts> <fact> <name>ANSWER</name> <column>ANSWER</column> <measure>ADDITIVE</measure> </fact> </facts> </factTable> <ray> <dimension> <schema>CALDEIAS</schema> <name>IPDW_QUESTION</name> </dimension> </ray> <ray> ... </ray> </star> </stars> ...</dwxml>
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 32/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE APPLICATIONFEATURES Integrates the SiardFromDb application to build the SIARD format of the data
warehouse
Extracts metadata for characterization of the dimensional model
Schemas, dimensions, hierarchies, levels, attributes, tables, table comments, primary and foreign
keys, views
Sorts the tables according to their role in the data warehouse
Proposes a DWXML description based on the extracted metadata
DWXML editing using GUI
Graphical representation of star schemas and dimensions and their
relationships
Creates, views and embeds the DWXML file into the SIARD format
Access and retrieves the primary data
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 33/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE ARCHITECTURE
Netbeans Platform 7 RC1 | JDK 7
Metadata Module
SIARD Module
DWXML Module
Connection Module
SIARDfromDB
JDOMOJDBC, …
Output Module
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 34/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE CONNECTION TO THE DW
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 35/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE SIARDFROMDB INTEGRATION
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 36/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE METADATA EXTRACTION
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 37/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE DWXML PROPOSAL
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 38/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE SCHEMAS VIEWER: STARS
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 39/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE SCHEMAS VIEWER: DIMENSIONS
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 40/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE DWXML EDITING
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 41/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE DWXML EMBEDDING AND VIEWING
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 42/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE PRIMARY DATA RETRIEVAL FROM XML
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 43/46Carlos Aldeias
1st of June, 2011
DBPRESERVE SUITE CASE STUDY RESULTS
Data Warehouse 17 tables (one with more than 2M records) Data size: 115 MB
SIARD Format 17 XML files with primary data (one with 323 MB) SIARD metadata size: 71 KB DWXML metadata size: 86 KB
Total size: 360 MB
Extraction times: •SIARD data: 2h30m•SIARD metadata : 4 min•DWXML metadata : 3 sec
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 44/46Carlos Aldeias
1st of June, 2011
CONCLUSIONS Definition of DWXML, a representation of the
dimensional model of a DW
Design and implementation of DBPreserve Suite Extraction of the metadata that describes the dimensional model Manual adjustments of the dimensional model Generation of the XML file and embedding into SIARD format file Primary data browse
The result is compliant with the SIARD Suite tools (just the relational level)
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 45/46Carlos Aldeias
1st of June, 2011
REFERENCES[CCSDS, 2002] Consultative Committee for Space Data Systems. Reference Model for an Open
Archival Information System (OAIS) - Blue Book. Washington: National Aeronautics and Space Administration, 2002.
[Ferreira, 2006] Miguel Ferreira. Introdução à Preservação Digital - Conceitos, estratégias e actuais consensos. Escola de Engenharia da Universidade do Minho, 2006.
[Hendley, 1998] Tony Hendley. Comparison of methods & costs of digital preservation. Technicalreport, British Library Research and Innovation Centre, 1998.
[Hummer, 2003] Wolfgang Hummer, Andreas Bauer, and Gunnar Harde. 2003. XCube: XML forData Warehouses. In Proceedings of the 6th ACM International Workshop onData Warehousing and OLAP (DOLAP '03). ACM, New York, NY, USA, 33-40.DOI=10.1145/956060.956067, http://doi.acm.org/10.1145/956060.956067
[Planets, 2010] Pauline Sinclair. The digital divide: Assessing organizations’ preparations for digitalpreservation. Planets White Paper, March 2010.
[Rahman, 2010] Arif Ur Rahman; Gabriel David; Cristina Ribeiro. Model migration approachfor database preservation. In The Role of Digital Libraries in a Time of GlobalChange, 12th International Conference on Asia-Pacific Digital Libraries, ICADL2010, Gold Coast, Australia., pages 81–90. Springer Berlin / Heidelberg, 2010.
[Ramalho, 2007] José Carlos Ramalho, Miguel Ferreira, Luís Faria, Rui Castro. Relational DatabasePreservation through XML Modelling. In Extreme Markup Languages 2007, 2007.
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions
DWXML – A Preservation Format for Data Warehouses 46/46Carlos Aldeias
1st of June, 2011
REFERENCES[SFA, 2008] Swiss Federal Archives SFA Unit Innovation and Preservation. Siard Format
Description. Technical Report, Federal Department of Home Aairs FDHA, Berne,2008.
[Thibodeau, 2002] Kenneth Thibodeau. Overview of technological approaches to digital preservation and challenges in coming years. In The State of Digital Preservation: An International Perspective. Documentation Abstracts, Inc. - Institutes for Information Science, 2002.
IntroductionMotivation
DW PreservationDWXML
DBPreserve Suite
Conclusions