a comparison of data warehouse design models

Upload: anidatta

Post on 03-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    1/105

    A COMPARISON OF DATA WAREHOUSE DESIGN MODELS

    A MASTERS THESIS

    in

    Computer Engineering

    Atilim University

    by

    BERIL PINAR BAARAN

    J ANUARY 2005

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    2/105

    i

    A COMPARISON OF DATA WAREHOUSE DESIGN MODELS

    A THESIS SUBMITTED TO

    THE GRADUATE SCHOOL OF NATURAL AND APPL IED SCIENCES

    OF

    ATILIM UNIVERSITY

    BY

    BERIL PINAR BAARAN

    IN PARTIAL FULFILLMENT OF THE REQ UIREMENTS FOR THE

    DEGREE OF

    MASTER OF SCIENCE

    IN

    THE DEPARTMENT OF COMPUTER ENGINEERING

    J ANUARY 2005

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    3/105

    ii

    Approval of the Graduate School of Natural and Applied Sciences

    _____________________

    Prof. Dr. Ibrahim Akman

    Director

    I certify that this thesis satisfies all the requirements as a thesis for the degree of Master

    of Science.

    _____________________

    Prof. Dr. Ibrahim Akman

    Head of Department

    This is to certify that we have read this thesis and that in our opinion it is fully adequate,in scope and quality, as a thesis for the degree of Master of Science.

    _____________________ _____________________

    Prof. Dr. Ali Yazici Dr. Deepti Mishra

    Co-Supervisor Supervisor

    Examining Committee Members

    Prof. Dr. Ali Yazici _____________________

    Dr. Deepti Mishra _____________________

    Asst. Prof. Dr. Nergiz E. altay _____________________

    Dr. Ali Arifolu _____________________

    Asst. Prof. Dr. idem Turhan _____________________

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    4/105

    iii

    ABSTRACT

    A COMPARISON OF DATA WAREHOUSE DESIGN MODELS

    Baaran, Beril Pnar

    M.S., Computer Engineering Department

    Supervisor: Dr. Deepti Mishra

    Co-Supervisor: Prof. Dr. Ali Yazici

    January 2005, 90 pages

    There are a number of approaches in designing a data warehouse both in conceptual

    and logical design phases. The generally accepted conceptual design approaches are

    dimensional fact model, multidimensional E/R model, starER model and object-oriented

    multidimensional model. And in the logical design phase, flat schema, terraced schema,

    star schema, fact constellation schema, galaxy schema, snowflake schema, star cluster

    schema and starflake schemas are widely used approaches. This thesis proposes a

    comparison of both the conceptual and the logical design models and a sample data

    warehouse design and implementation is provided. It is observed that in the conceptual

    design phase, object-oriented model provides the best solution and for the logical design

    phase, star schema is generally the best in terms of performance and snowflake is

    generally the best in terms of redundancy.

    Keywords: Data Warehouse, Design Methodologies, DF, starER, ME/R, OOMD,

    flat schema, terraced schema, star schema, fact constellation schema, galaxy schema,

    snowflake schema, star cluster schema, starflake schema, DTS, Data Analyzer

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    5/105

    iv

    Z

    VER AMBARI TASARIM MODELLER KARILATIRMASI

    Baaran, Beril Pnar

    Yksek Lisans, BilgisayarMhendislii Blm

    Tez Yneticisi: Dr. Deepti Mishra

    Ortak Tez Yneticisi: Prof. Dr. Ali Yazici

    Ocak 2005, 90 sayfa

    Veri ambar tasarmnn kavramsal ve mantksal tasarm aamalar iin birden fazla

    yaklam vardr. Kavramsal tasarm safhas iin genelolarak kabul grm yaklamlar

    dimensional fact, multidimensional E/R, starER ve object-oriented

    multidimensional modelleridir. Mantksal tasarm safhas iin genel olarak kabul

    grm yaklamlar flat, terraced, star, fact constellation, galaxy ,

    snowflake, star cluster ve starflake emalardr. Bu tez, kavramsal ve mantksal

    tasarm modellerini karlatrr, rnek bir veri ambar tasarmn ve uygulamasn ierir.

    Bu tezde, kavramsal tasarm aamasnda object-oriented multidimensional modelinin;

    mantksal tasarm aamasnda performanskriteri asndan star emann, veri tekrar

    kriteri asndan snowflake emann en iyi zmler olduu gzlendi.

    Anahtar Kelimeler: VeriAmbar, Tasarm Yntemleri, DF, starER, ME/R, OOMD, flat

    ema, terraced ema, star ema, fact constellation ema, galaxy ema, snowflake ema,

    star cluster ema, starflake ema, DTS, Data Analyzer

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    6/105

    v

    To my dear husband

    Thanks for his endless support

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    7/105

    vi

    ACKNOWLEDGEMENTS

    First, I would like to thank my thesis advisor Dr. Deepti MISHRA and co-

    supervisor Prof. Dr. Ali YAZICI for their guidance, insight and encouragement

    throughout the study.

    I should also express my appreciation to examination committee members Asst.

    Prof. Dr. Nergiz E. AILTAY, Dr. Ali ARIFOLU, Asst. Prof. Dr. idem

    TURHAN for their valuable suggestions and comments.

    I would like to express my thanks to my husband for his assistance, encouragement

    and all members of my family for their patience, sympaty and support during the study.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    8/105

    vii

    TABLE OF CONTENTS

    ABSTRACT .......................................................................................................................... iii

    Z........................................................................................................................................... iv

    ACKNOWLEDGEMENTS.................................................................................................. vi

    TABLE OF CONTENTS.....................................................................................................vii

    LIST OF TABLES .................................................................................................................x

    LIST OF FIGURES...............................................................................................................xi

    LIST OF ABBREVIATIONS ............................................................................................xiii

    CHAPTER

    1 INTRODUCTION.............................................................................................................. 1

    1.1. Scope and outline of the thesis...................................................................................2

    2 DATA WAREHOUSE CONCEPTS ................................................................................. 3

    2.1. Definition of Data Warehouse ...................................................................................3

    2.2. Why OLAP systems must run with OLTP................................................................ 5

    2.3. Requirements for Data Warehouse Database Management Systems......................8

    3 FUNDAMENTALS OF DATA WAREHOUSE............................................................10

    3.1. Data acquisition......................................................................................................... 12

    3.1.1. Extraction, Cleansing and Transformation Tools ............................................13

    3.2. Data Storage and Access .......................................................................................... 13

    3.3. Data Marts ................................................................................................................. 14

    4 DESIGNING A DATA WAREHOUSE.......................................................................... 164.1. Beginning with Operational Data ............................................................................16

    4.2. Data/Process Models ................................................................................................ 18

    4.3. The DW Data Model ................................................................................................ 19

    4.3.1. High-Level Modeling ........................................................................................ 19

    4.3.2. Mid-Level Modeling ......................................................................................... 21

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    9/105

    viii

    4.3.3. Low-Level Modeling......................................................................................... 23

    4.4. Database Design Methodology for DW .................................................................. 24

    4.5. Conceptual Design Models ...................................................................................... 27

    4.5.1. The Dimensional Fact Model............................................................................ 27

    4.5.2. Multidimensional E/R Model ...........................................................................30

    4.5.3. starER ................................................................................................................. 33

    4.5.4. Object-Oriented Multidimensional Model (OOMD) ......................................35

    4.6. Logical Design Models............................................................................................. 36

    4.6.1. Dimensional Model Design .............................................................................. 37

    4.6.2. Flat Schema........................................................................................................ 39

    4.6.3. Terraced Schema ............................................................................................... 40

    4.6.4. Star Schema........................................................................................................ 414.6.5. Fact Constellation Schema................................................................................ 43

    4.6.6. Galaxy Schema ..................................................................................................43

    4.6.7. Snowflake Schema ............................................................................................44

    4.6.8. Star Cluster Schema .......................................................................................... 45

    4.6.9. Starflake Schema ............................................................................................... 47

    4.6.10. Cube.................................................................................................................. 48

    4.7. Meta Data ..................................................................................................................53

    4.8. Materialized views....................................................................................................53

    4.9. OLAP Server Architectures .....................................................................................54

    5 COMPARISON OF MULTIDIMENSIONAL DESIGN MODELS.............................56

    5.1. Comparison of Dimensional Models and ER Models ............................................56

    5.2. Comparison of Dimensional Models and Object-Oriented Models ......................57

    5.3. Comparison of Conceptual Multidimensional Models........................................... 58

    5.4. Comparison of Logical Design Models...................................................................60

    5.5. Discussion on Data Warehousing Design Tools..................................................... 61

    6 IMPLEMENTING A DATA WAREHOUSE.................................................................64

    6.1. A Case Study............................................................................................................. 64

    6.2. OOMD Approach...................................................................................................... 65

    6.3. starER Approach ....................................................................................................... 68

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    10/105

    ix

    6.4. ME/R Approach ........................................................................................................ 70

    6.5. DF Approach ............................................................................................................. 72

    6.6. Implementation Details............................................................................................. 74

    7 CONCLUSIONS AND FUTURE WORK...................................................................... 83

    7.1. Contributions of the Thesis ...................................................................................... 85

    7.2. Future Work .............................................................................................................. 86

    REFERENCES ..................................................................................................................... 87

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    11/105

    x

    LIST OF TABLES

    TABLE

    2.1 Comparison of OLTP and OLAP.................................................................................... 7

    4.1 2-dimensional pivot view of an OLAP Table .............................................................49

    4.2 3-dimensional pivot view of an OLAP Table .............................................................49

    5.1 Comparison of ER, DM and OO methodologies ......................................................... 585.2 Comparison of conceptual design models....................................................................60

    5.3 Comparison of logical design models...........................................................................61

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    12/105

    xi

    LIST OF FIGURES

    FIGURE

    2.1 Consolidation of OLTP information ............................................................................... 4

    2.2 Same attribute with different formats in different sources............................................4

    2.3 Simple comparison of OLTP and DW systems .............................................................5

    3.1 Architecture of DW........................................................................................................ 10

    4.1 Data Extract ion............................................................................................................... 16

    4.2 Data Integration .............................................................................................................. 17

    4.3 Same data, different usage............................................................................................. 17

    4.4 A Simple ERD for a manufacturing environment........................................................ 20

    4.5 Corporate ERD created by departmental ERDs........................................................... 20

    4.6 Relationship between ERD and DIS............................................................................. 21

    4.7 Midlevel model members .............................................................................................. 21

    4.8 A Midlevel model sample.............................................................................................. 224.9 Corporate DIS formed by departmental DISs. .............................................................23

    4.10 An example of a departmental DIS.............................................................................23

    4.11 Considerations in low-level modeling ........................................................................ 24

    4.12 A dimensional fact schema sample............................................................................. 28

    4.13 The graphical notation of ME/R elements.................................................................. 31

    4.14 Multiple cubes sharing dimensions on different levels .............................................32

    4.15 Combining ME/R notations with E/R.........................................................................33

    4.16 Notation used in starER.............................................................................................. 33

    4.17 A sample DW model using starER .............................................................................35

    4.18 Flat Schema ................................................................................................................. 40

    4.19 Terraced Schema......................................................................................................... 41

    4.20 Star Schema................................................................................................................. 42

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    13/105

    xii

    4.21 Fact Constellation Schema ......................................................................................... 43

    4.22 Galaxy Schema............................................................................................................ 44

    4.23 Snowflake Schema ...................................................................................................... 45

    4.24 Star Schema with fork .............................................................................................. 46

    4.25 Star Cluster Schema ....................................................................................................47

    4.26 Starflake Schema......................................................................................................... 47

    4.27 Comparison of schemas............................................................................................... 48

    4.28 3-D Realization of a Cube ...........................................................................................50

    4.29 Operations on a Cube................................................................................................... 52

    6.1 ER model of sales and shipping systems...................................................................... 65

    6.2 Use case diagram of sales and shipping system........................................................... 66

    6.3 Statechart diagram of sales and shipping system......................................................... 676.4 Static structure diagram of sales and shipping system ................................................ 67

    6.5 Sales subsystem starER model..................................................................................... 69

    6.6 Shipping subsystem starER model................................................................................ 70

    6.7 Sales subsystem ME/R model ....................................................................................... 71

    6.8 Shipping subsystem ME/R model................................................................................. 72

    6.9 Sales subsystem DF model............................................................................................73

    6.10 Shipping subsystem DF model.................................................................................... 73

    6.11 Snowflake schema for the sales subsystem............................................................... 74

    6.12 Snowflake schema for the shipping subsystem.........................................................75

    6.13 General architecture of the case study ........................................................................ 75

    6.14 Sales DTS Package ...................................................................................................... 77

    6.15 Shipping DTS Package ................................................................................................ 77

    6.16 Transformation details for delimited text file ........................................................... 78

    6.17 Transact-SQL query as the transformation source.................................................... 79

    6.18 Pivot Chart using Excel as client ............................................................................... 80

    6.19 Pivot Table using Excel as client ............................................................................... 80

    6.20 Data Analyzer as client ............................................................................................... 81

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    14/105

    xiii

    LIST OF ABBREVIATIONS

    3GL - Third Generation Language

    4GL - Fourth Generation Language

    DAG - Directed Acyclic Graph

    DB - Database

    DBMS - Database Management Systems

    DDM - Data Dimensional Modeling

    DF - Dimensional Fact

    DIS - Data Item Set

    DSS - Decision Support System

    DTS - Data Transformation Services

    DW - Data Warehouse

    ER - Entity RelationshipERD - Entity Relationship Diagram

    ETL - Extract, Transform, Load

    HOLAP - Hybrid OLAP

    I/O - Input/Output

    IT - Information Technology

    ME/R - Multidimensional E/R

    MOLAP - Multidimensional OLAP

    ODBC - Open Database Connectivity

    OID - Object Identifier

    OLAP - Online Analytical Processing

    OLTP - Online Transaction Processing

    OO - Object Oriented

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    15/105

    xiv

    OOMD - Object Oriented Multidimensional

    RDBMS - Relational Database Management Systems

    ROLAP - Relational OLAP

    SQL - Structured Query Language

    UML - Unified Modeling Language

    XML - Extensible Markup Language

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    16/105

    1

    CHAPTER 1

    INTRODUCTION

    Information is an asset that provides benefit and competitive advantage to any

    organization. Today, every corporation have a relational database management system

    that is used for organizations daily operations. The companies desire to increase the

    value of their organizational data by turning it into actionable information. As the

    amount of the organizational data increases, it becomes harder to access and get the most

    information out of it, because it is in different formats, exists on different platforms and

    resides on different structures. Organizations have to write and maintain several

    programs to consolidate data for analysis and reporting. Also, the corporate decision-

    makers require access to all the organizations data at any level, which may mean

    modifications on existing or development of new consolidation programs. This process

    would be costly, inefficient and time consuming for an organization.

    Data warehousing provides an excellent approach in transforming operational data

    into useful and reliable information to support the decision making process and also

    provides the basis for data analysis techniques like data mining and multidimensional

    analysis. Data warehousing process contains extraction of data from heterogenous data

    sources, cleaning, filtering and transforming data into a common structure and storing

    data in a structure that is easily accessed and used for reporting and analysis purposes.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    17/105

    2

    As the need for building an organizational data warehouse is clear, now the

    question is how. There are generally accepted design methodologies in designing and

    implementing a data warehouse. The focus of this thesis is discussing the data

    warehouse conceptual and logical design models and comparing these approaches.

    1.1.Scope and outline of the thesis

    The thesis organized as follows: Chapter 2 presents an overview of data warehouse

    concepts and makes a comparison between operational and analytical processing

    systems. Chapter 3 provides information on data warehousing fundamentals and process.

    Chapter 4 gives information on data warehouse design approaches used in conceptual

    and logical design phases. In chapter 5, the design approaches described in chapter 4 are

    discussed and compared. Finally in chapter 6, a sample conceptual model is logicallyimplemented using the logical design models and the physical implementation of a data

    warehouse is described.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    18/105

    3

    CHAPTER 2

    DATA WAREHOUSE CONCEPTS

    2.1.Definition of Data Warehouse

    A data warehouse (DW)refers to a database that is different from the organizations

    Online Transaction Processing (OLTP) database and that is used for the analysis of

    consolidated historical data.

    According to Barry Devlin, IBM Consultant, a DW is simply a single, complete

    and consistent store of data obtained from a variety of sources and made available to endusers in a way they can understand and use it in a business context[1, 3].

    According to W.H. Inmon, a DW is a subject-oriented , integrated , time-variant,

    and nonvolatile collection of data in support of managements decision making process

    [1, 2, 3, 6, 10, 11].

    The description of the four key features of the DW is given below.

    Subject-oriented: In general, an enterprise contains information that is very detailed to

    meet all requirements needed for related subsets of the organization (sales dept, humanresources dept, marketing dept etc.) and optimized for transaction processing. Usually,

    this type of data is not suitable for decision-makers to use. Decision-makers need

    subject-oriented data. DW should include only key business information. The data in the

    warehouse should be organized based on subject and only subject-oriented data should

    be moved into a warehouse.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    19/105

    4

    If the decision-maker needs to find all information about a spesific product, he/she

    would need to use all systems like rental sales system, order sales system and catalog

    sales system, which is not the preferable and the practical way. Instead, all the key

    information must be consolidated in a warehouse and organized into subject areas as

    illustrated in Figure 2.1.

    Figur e 2.1 Consolidation of OLTP information

    Integrated: DW is an architecture constructed by integrating data from multiple

    heterogeneous sources (like relational database (DB), flat files, excel sheets, XML data,

    data from the legacy systems) to support structured and/or ad hoc queries, analytical

    reporting and decision making. DW also provides mechanisms for cleaning and

    standardizing data. Figure 2.2 emphasizes various uses and formats of Product Codeattribute.

    Figur e 2.2 Same attribu te with different formats in d ifferent sources

    Time-variant: DW provides information from a historical prospective. Every keystructure in the DW contains, either implicitly or explicitly, an element of time. A DW

    generally stores data that is 5-10 years old, to be used for comparisons, trends and

    forecasting.

    Nonvolatile: Data in the warehouse are not updated or changed (see Figure 2.3), so it

    does not require transaction processing, recovery and concurrency control mechanisms.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    20/105

    5

    The operations needed in the DW are initial loading of data and access of data and

    refresh.

    Figur e 2.3 Simple compar ison of OLTP and DW systems

    Some of the DW characteristics are given below;

    It is a database that is maintained separately from organizations operational

    databases.

    It allows for integration of various application systems.

    It supports information processing by consolidating historical data.

    User interface aimed at decision-makers.

    It contains large amount of data.

    It is updated infrequently but periodically updates are required to keep the

    warehouse meaningful and dynamic.

    It is subject-oriented.

    It is non-volatile.

    Data is longer-lived. Transaction systems may retain data only until processing is

    complete, whereas data warehouses may retain data for years.

    Data is stored in a format that is structured for querying and analysis.

    Data is summarized. DWs usually do not keep as much detail as transaction-

    oriented systems.

    2.2.

    Why OLAP systems must r un with OLTP

    In this section, I aim to make a comparison of OLTP and Online Analytical

    Processing (OLAP) systems and explain the reasons why an OLAP system is needed.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    21/105

    6

    The nature of OLTP and OLAP systems are completely different both in technical

    and in business needs.

    The following table compares OLTP systems OLAP systems in main technical

    topics

    OLTP OLAP

    User and System

    Orientation

    Thousands users, customer-

    oriented, used for

    transactions and querying

    clerks, clients and

    Information Technology

    (IT) professionals

    Hundreds users, market-

    oriented, used for data

    analysis by knowledge

    workers

    Data Contents Manages current data, very

    detail-oriented

    Manages large amounts of

    historical data, provides

    facilities for summarization

    and aggregation, stores

    information at different

    levels of granularity to

    support decision makingprocess

    Data is continuously

    updated

    Data is refreshed

    Data is volatile and

    normalized (Entity-

    Relationship (ER) Model)

    Data is non-volatile and de-

    normalized (Dimensional

    Model)

    Database Design Adopts an ER model and an

    application-oriented

    database design, index/hash

    on primary key.

    Adopts star, snowflake, or

    fact constellation model

    and a subject-oriented

    database design, lots of

    scans.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    22/105

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    23/105

    8

    OLTP and OLAP systems need to run different types of queries. They may

    provide different functionality and use different types on queries.

    The main roles in a company that will use a DW solution are [4];

    Top executives and decision makers

    Middle/operational managers

    Knowledge workers

    Non-technical business related individuals

    The main advantages of using a DW solution are summarized in the list below [2, 3, 6];

    High query performance

    Does not interfere with local processing at sources

    Information copied at warehouse (can modify, summarize, restructure, etc.)

    Potential high Return on Investment

    Competitive advantage

    Increase productivity of corporate decision makers

    As discussed above, a DW solution has many advantages and benefits to an

    organization. Also implementing a DW solution solves some business problems, it may

    bring some new self-owned problems mentioned below [2, 6];

    Underestimation of resources for data loading

    Hidden problems with source systems

    Required data not captured

    Increased end-user demands

    High maintenance

    Long duration projects

    Complexity of integration

    Data homogenization

    High demand for resources

    Data ownership

    2.3.

    Requirements for Data War ehouse Database Management Systems

    In the implementation of a DW solution, many technical points must be considered.

    While an OLTP database management systems (DBMS) must only consider transaction

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    24/105

    9

    processing performance (which is basically; a transaction must be completed in the

    minimum time; without deadlocks; and with support of thousands of transactions per

    second)

    The relational DBMS (RDBMS) suitable for data warehousing has the following

    requirements [6];

    Load performance: Data warehouses need incremental loading of data

    periodically so the load process performance should be like gigabytes of data per

    hour.

    Load processing:Data conversion, filtering, indexing and reformatting may be

    necessary during loading data into the data warehouse. This process should be

    executed as a single unit of work.

    Data quality management: The warehouse must ensure consistency and

    referential integrity despite various data sources and big data size. The measure

    of success for a data warehouse is the ability to satisfy business needs.

    Query Performance: Complex queries must complete in acceptable periods.

    Terabyte scalability: The data warehouse RDBMS should not have any

    database size limitations and should provide recovery mechanisms.

    Mass user scalability: The data warehouse RDBMS should be able to support

    hundreds of concurrent users.

    Warehouse administration: Easy-to-use and flexible administrative tools

    should exists for data warehouse administration.

    Advanced query functionality: The data warehouse RDBMS should supply

    advanced analytical operations to enable end-users perform advanced

    calculations and analysis.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    25/105

    10

    CHAPTER 3

    FUNDAMENTALS OF DATA WAREHOUSE

    The main reason for building a DW is to improve the quality of information in the

    organization. Data coming from both internal and external sources in various formats

    and structures is consolidated and integrated into a single repository. DW system

    comprises the data warehouse and all components used for building, accessing and

    maintaining the data warehouse.

    Figure 3.1 Architecture of DW

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    26/105

    11

    A general architecture of a DW is given in Figure 3.1 and the main components are

    described below [5, 32].

    The data import and preparation component is responsible for data acquisition. It

    includes all programs (like Data Transformation Services (DTS)) that are responsible for

    extracting data from operational sources, preparing and loading it into the warehouse.

    The access component includes all applications (like OLAP) that use the

    information stored in the warehouse.

    Additionally, a metadata management component is responsible for the

    management, definition and access of all different types of metadata. Metadata is

    defined as data describing the meaning of data. In data warehousing, there are various

    types of metadata, e.g., information about the operational sources, the structure andsemantics of the data warehouse data, the tasks performed during the construction, the

    maintenance and access of a data warehouse, etc.

    Implementing a DW is a complex task containing two major phases. In the

    configuration phase, a conceptual view of the warehouse is first specified according to

    user requirements (DW design). Then, the related data sources and the Extraction-Load-

    Transform (ETL) process (data acquisition) are determined. Finally, decisions about

    persistent storage of the warehouse using database technology and the various ways datawill be accessed during analysis are made.

    After the initial load (the first load of the DW according to the configuration),

    during the operation phase, warehouse data must be regularly refreshed, i.e.,

    modifications of operational data since the last DW refreshment must be propagated into

    the warehouse such that data stored in the data warehouse reflect the state of the

    underlying operational systems.

    A more natural way to consider multidimensionality of warehouse data is providedby the multidimensional data model. In this model, the data cube is the basic modeling

    construct. Operations like pivoting (rotate the cube), slicing-dicing (select a subset of the

    cube), roll-up and drill-down (increasing and decreasing the level of aggregation) can be

    applied to a data cube. For the implementation of multidimensional databases, there are

    two main approaches. In the first approach, extended RDBMSs, called relational OLAP

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    27/105

    12

    (ROLAP) servers, use a relational database to implement the multidimensional model

    and operations. ROLAP servers provide SQL extensions and translate data cube

    operations to relational queries. In the second approach, multidimensional OLAP

    (MOLAP) servers store multidimensional data in non-relational specialized storage

    structures. These systems usually precompute the results of complex operations (during

    storage structure building) in order to increase performance.

    3.1.

    Data acquisition

    Data extraction is one of the most time-consuming tasks of DW development. Data

    consolidated from heterogenous systems may have problems, and may need to be first

    transformed and cleaned before loaded into the DW. Data gathered from operational

    systems may be incorrect, inconsistent, unreadable or incomplete. Data cleaning is anessential task in data warehousing process in order to get correct and qualitative data

    into the DW. This process contains basically the following tasks: [5]

    converting data from heterogenous data sources with various external

    representations into a common structure suitable for the DW

    identifying and eliminating redundant or irrelevant data

    transforming data to correct values (e.g., by looking up parameter usage and

    consolidating these values into a common format) reconciling differences between multiple sources, due to the use of homonyms

    (same name for different things), synonyms (different names for same things) or

    different units of measurement

    As the cleaning process is completed, the data that will be stored in the warehouse

    must be merged and set into a common detail level containing time related information

    to enable usage of historical data. Before loading data into the DW, tasks like filtering,

    sorting, partitioning and indexing may need to be performed. After these processes, the

    consolidated data may be imported into the DW using one of bulk data loaders, a custom

    application or an import/export wizard provided by the DBMS administration

    applications.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    28/105

    13

    3.1.1.Extr action, Cleansing and Transformation Tools

    The tasks of capturing data from a source system, cleansing, and transforming the

    data and loading the consolidated data into a target system can be done either by

    separate products or by a single integrated solution. Integrated solutions fall into one of

    the following categories [6]:

    Code generators

    Database data replication tools

    Dynamic transformation engines

    There are solutions that fulfill all of the requirements mentioned above. One of these

    products is Microsoft Data Transformation Services is described in chapter 6.

    Code generators

    Code generators create customized 3GL, 4GL transformation programs based on source

    and target data definitions. The main issue with this approach is the management of the

    large number of programs required to support a complex corporate DW.

    Database data r eplication tools

    Database data replication tools employ database triggers or a recovery log to capture

    changes to a single data source on one system and apply the changes to a copy of the

    source data located on a different system. Most replication products dont support the

    capture of changes to non-relational files and databases and often not provide facilities

    for significant data transformation and enhancement. These tools can be used to rebuild

    a database following failure or to create a database for a data mart, provided that the

    number of data sources is small and the level of data transformation is relatively simple.

    Dynamic tr ansformation engines

    Rule-driven dynamic transformation engines capture data from a source system at user-

    defined intervals, transform the data and then send and load the results into a target

    environment. Most products support only relational data sources, but products are nowemerging that handle non-relational source files and databases.

    3.2.Data Storage and Access

    Because of the special nature of warehouse data and access, accustomed

    mechanisms for data storage, query processing and transaction management must be

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    29/105

    14

    adapted. DW solutions need complex querying requirements and operations involving

    large volumes of data access. These operations need special access methods, storage

    structures and query processing techniques.

    The storage approaches of a DW is described in detail in section 4.9. One of these

    physical storage methods may be chosen concerning the trade-off between query

    performance and amount of data.

    Once the DW is available for end-users, there are a variety of techniques to enable

    end-users access the DW data for analysis and reporting. There are several tools and

    products that are commercially available. In common all client tools use generally

    OLEDB, ODBC or native client providers to access the DW data. The most

    commercially used client application is Microsoft Excel with pivot tables.

    A company that makes business in several countries througout the world may need

    to analyse regional trends and my need to compete in regions. A centric DW may not be

    feasible for these companies. These organizations may need to establish data marts

    which are selected parts of the DW that support specific decision support application

    requirements of a companys department or geographical region. Data marts usually

    contain simple replicas of warehouse partitions or data that has been further summarized

    or derived from base warehouse data. Data marts allow the efficient execution of

    predicted queries over a significantly smaller database.

    3.3.Data Marts

    A data mart is a subset of the data in a DW and is summary data relating to a

    department or a specific function [6]. Data marts focus on the requirements of users in a

    particular department or business function of an organization. Since data marts are

    specialized for departmental operations, they contain less data and the end-users are

    much capable of exploiting data marts than DWs. The main reasons for implementing adata mart instead of a DW may be summarized as follows:

    Data marts enable end-users to analyze the data they need most often in their

    daily operations.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    30/105

    15

    Since data marts contain less data, the end-user response time in queries is much

    quicker.

    Data marts are more specialized and contain less data, therefore data

    transformation and integration tasks are much faster in data marts than DWs and

    setting up a data mart is a simpler and a cheaper task compared to establishing an

    organizational DW in terms of time and resources.

    In terms of software engineering, building a data mart may be a more feasible

    project than building a DW, because the requirements of building a data mart are

    much more explicit than a corporate wide DW project.

    Although data marts seem to have advantages over DWs, there are some issues that must

    be addressed about data marts.

    Size:Although data marts are considered to be smaller than data warehouses, size and

    complexity of some data marts may match a small corporate DW. As the size of a data

    mart increases, it is likely to have a performance decrease.

    Load performance: Both end-user response time and data loading performance are

    critical tasks of data marts. For increasing the response time, data marts usually contain

    lots of summary tables and aggregations which have a negative effect on load

    performance.

    User access to data in multiple data marts: A solution to this problem is buildingvirtual data marts which are views of several physical data marts.

    Administration: With the increase in number of data marts, the management need

    arises to coordinate data mart activities such as versioning, consistency, integrity,

    security and performance tuning.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    31/105

    16

    CHAPTER 4

    DESIGNING A DATA WAREHOUSE

    Designing a warehouse means to complete all the requirements mentioned in

    section 2.3 and obviously is a complicated process.

    There are two major components to build a DW; the design of the interface from

    operational systems and the design of the DW [11]. DW design is different from a

    classical requirements-driven systems design.

    4.1.

    Beginning with Opera tional DataCreating the DW does not only involve extracting operational data and entering it

    into the warehouse (Figure 4.1) .

    Figur e 4.1 Data Extraction

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    32/105

    17

    Pulling the data into the DW without integrating it is a big mistake ( Figure 4.2 ).

    Figur e 4.2 Data Integration

    Existing applications were designed with their own requirements and integration

    with other applications was not concerned much. These results in data redundancy, i.e.

    same data may exist in other applications with same meaning, with different name or

    with different measure ( Figure 4.3 ).

    Figur e 4.3 Same data , different usage

    Another problem is the performance of accessing existing systems data. The

    existing systems environment holds gigabytes and perhaps terabytes of data, and

    attempting to scan all of it every time a DW load needs to be done is resource and timeconsuming and unrealistic.

    Three types of data are loaded into the DW from the operational system:

    Archival data

    Data currently contained in the operational environment

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    33/105

    18

    Changes to the DW environment from the updates that have occurred in the

    operational system since the last refresh

    Five common techniques are used to limit the amount of operational data scanned to

    refresh the DW.

    Scan data that has been timestamped in the operational environment.

    Scan a 'delta' file. A delta file contains only the changes made to an application

    as a result of the transactions that have run through the operational environment.

    Scan a log file or an audit file created by the transaction processing system. A

    log file contains the same data as a delta file.

    Modify application code.

    Rubbing a 'before' and an 'after' image of the operational file together.

    Another difficulty is that operational data must undergo a time-basis shift as it

    passes into the DW. The operational datas accuracy is valid at the instant it is accessed,

    after that it may be updated. However when the data is loaded into the warehouse, it

    cannot be updated anymore, so a time element must be attached to it.

    Another problem when passing data is the need to manage the volume of data that

    resides in and passes into the warehouse. Volume of data in the DW will grow fast.

    4.2.

    Data/Process ModelsThe process model applies only to the operational environment. The data model

    applies to both the operational environment and the DW environment.

    A process model consists:

    Functional decomposition

    Context-level zero diagram

    Data flow diagram

    Structure chart State transition diagram

    Hierarchical input process output(HIPO) chart

    Pseudo code

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    34/105

    19

    A process model is invaluable, for instance, when building the data mart. The

    process model is requirements-based; it is not suitable for the DW.

    The data model is applicable to both the existing systems environment and the DW

    environment. An overall corporate data model has been constructed with no regard for a

    distinction between existing operational systems and the DW. The corporate data model

    focuses on only primitive data. Performance factors are added into the corporate data

    model as the model is transported to the existing systems environment. Although few

    changes are made to the corporate data model for operational environment, more

    changes are made to the corporate data model to use in DW environment. First, data that

    is used purely in the operational environment is removed. Next, the key structures of the

    corporate data model are enhanced with an element of time. Derived data is added to the

    corporate data model where the derived data is publicly used and calculated once, not

    repeatedly. Finally, data relationships in the operational environment are turned into

    artifacts in the DW. A final design activity in transforming the corporate data model to

    the data warehouse data model is to perform stability analysis. Stability analysis

    involves grouping attributes of data together based on their tendency for change.

    4.3.The DW Data Model

    There are three levels in data modeling process: high-level modeling (called the

    ERD, entity relationship level), midlevel modeling (called the data item set, or DIS), and

    low-level modeling (called the physical model).

    4.3.1.High-Level Modeling

    The high level of modeling features entities and relationships. The name of the

    entity is surrounded by an oval. Relationships among entities are depicted with arrows.

    The direction and number of the arrowheads indicate the cardinality of the relationship,

    and only direct relationships are indicated.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    35/105

    20

    Figur e 4.4 A Simple ERD for a manufacturing environment

    The entities that are shown in the ERD level (see Figure 4.4) are at the highest level

    of abstraction.

    The corporate ERD as shown in Figure 4.5 is formed of many individual ERDs that

    reflect the different views of people across the corporation. Separate high-level data

    models have been created for different communities within the corporation. Collectively,

    they make up the corporate ERD.

    Figure 4.5 Corp ora te ERD created by depar tmental ERDs

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    36/105

    21

    4.3.2.Mid-Level Modeling

    After the high-level data model is created, the next level is establishedthe

    midlevel model or the DIS. For each major subject area, or entity, identified in the high-

    level data model, a midlevel model is created. Each area is subsequently developed into

    its own midlevel model (see Figure 4.6)

    Figur e 4.6 Relationship between ERD and DIS

    Four basic constructs are found at the midlevel model (also shown in Figure 4.7):

    A primary grouping of data

    A secondary grouping of data

    A connector, suggesting the relationships of data between major subject areas

    Type of data

    Figure 4.7 Midlevel model members

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    37/105

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    38/105

    23

    Figur e 4.9 Corporate DIS formed by depar tmental DISs.

    Figure 4.10 shows an individual departments DIS.

    Figur e 4.10 An example of a depar tmental DIS

    4.3.3.

    Low-Level Modeling

    The physical data model is created from the midlevel data model just by extending

    the midlevel data model to include keys and physical characteristics of the model. At

    this point, the physical data model looks like a series of tables, sometimes called

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    39/105

    24

    relational tables. With the DW, the first step in doing so is deciding on the granularity

    and partitioning of the data.

    After granularity and partitioning are factored in, a variety of other physical design

    activities are embedded into the design. At the heart of the physical design

    considerations is the usage of physical input/output (I/O). Physical I/O is the activity that

    brings data into the computer from storage or sends data to storage from the computer.

    The job of the DW designer is to organize data physically for the return of the

    maximum number of records from the execution of a physical I/O. Figure 4.11 illustrate

    the major considerations in low-level modeling.

    Figure 4.11 Consider ations in low-level modeling

    There is another mitigating factor regarding physical placement of data in the data

    warehouse: Data in the warehouse normally is not updated. This frees the designer to use

    physical design techniques that otherwise would not be acceptable if it were regularly

    updated.

    4.4.Database Design Methodology for DW

    In the next few sections of this thesis I will be discussing both conceptual and

    logical design methods of data warehousing. Adopting the terminology of [23, 36, 37,

    38] three different design phases are distinguished; conceptual design manages concepts

    that are close to the way users perceive data; logical design deals with concepts related

    to a certain kind of DBMS; physical design depends on the specific DBMS and

    describes how data is actually stored [35, 40].

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    40/105

    25

    Prior to beginning the discussion, the basic concepts of dimensional modeling

    should be mentioned which are: facts, dimensions and measures [7, 24].

    A fact is a collection of related data items, consisting of measures and context

    data. It typically represents business items or business transactions. A dimension is a collection of data that describe one business dimension.

    Dimensions determine the contextual background for the facts; they are the

    parameters over which we want to perform OLAP.

    A measure is a numeric attribute of a fact, representing the performance or

    behavior of the business relative to the dimensions.

    Before this discussion, I also prefer to summarize the methodology proposed by

    Kimball [21], who is accepted as a guru on data warehousing and whose studies have

    encouraged many academicians on the study of data warehousing.

    The nine step methodology by Kimball is as follows[6, 42, 43]:

    1. Choosing the process: The process (function) refers to the subject matter of a

    particular data mart. The first data mart to be built should be the one that is most

    likely to be delivered on time with in budget and to answer the most important

    business question.

    2. Choosing the grain: This means deciding exactly what a fact table record

    represents. Only when the grain for the fact table is chosen can we identify the

    dimensions of the fact table. The grain decision for the fact table also determines

    the grain of each of the dimension tables.

    3. Identifying and conforming the dimensions: Dimensions set the context for

    asking questions about the facts in the fact table. A well-built set of dimensions

    makes the data mart understandable and easy to use. A poorly presented or

    incomplete set of dimensions will reduce the usefulness of a data mart to an

    enterprise. When a dimension is used in more than one data mart, the dimensionis referred to as being conformed.

    4. Choosing the facts : The grain of the fact table determines which facts can be

    used in the data mart. All the facts must be expressed at the level implied by the

    grain. The facts should be numeric and additive. Additional facts can be added to

    a fact table at any time provided they are consistent with the grain of the table.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    41/105

    26

    5. Storing pre-calculations in the fact table : Once the facts have been selected each

    should be re-examined to determine whether there are opportunities to use pre-

    calculations.

    6. Rounding out the dimension tables : We return to the dimension tables and add

    as much text description to the dimensions. The text descriptions should be as

    intuitive and understandable to the users. The usefulness of a data mart is

    determined by the scope and nature of the attributes of the dimension tables.

    7. Choosing the duration of the database: The duration measures how far back in

    time the fact table goes. There is requirement to look at the same time period a

    year or two earlier. Very large fact tables raise at least two very significant DW

    design issues. First, it is often increasingly difficult to source increasingly old

    data. The older data, the more likely there will be more problems in reading andinterpreting the old files or the old tapes. Second, it is mandatory that the old

    versions of the important dimensions be used, not the most current versions. This

    is known as the slowly changing dimension problem.

    8. Tracking slowly changing dimensions: There are three basic types of slowly

    changing dimensions:

    o Type1: where a changed dimension attribute is overwritten,

    o Type2: where a changed dimension attribute causes a new dimension

    record to be created,

    o Type3: a changed dimension attribute causes an alternate attribute to be

    created so that both the old and new values of the attribute are

    simultaneously accessible in the same dimension record.

    9. Deciding the query priorities and the query modes: We consider physical design

    issues. The most critical physical design issues affecting the end-users

    perception of the data mart are physical sort order of the fact tab le on disk and

    the presence of pre-stored summaries or aggregations. There are additional

    physical design issues affecting administration, backup, indexing performance,

    and security.We have a design for data mart that supports the requirements of a

    particular business process and also allows the easy integration with other related

    data marts to ultimately form the enterprise-wide DW.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    42/105

    27

    4.5.Conceptual Design Models

    The main goal of conceptual design modeling is developing a formal, complete,

    abstract design based on the user requirements [34].

    At this phase of a DW there is the need to: Represent facts and their properties: Facts properties are usually numerical and

    can be summarized (aggregated).

    Connect the dimension to facts: Time is always associated to a fact.

    Represent objects and capture their properties with the associations among them:

    Object properties (summary properties) can be numeric. Additionally there are

    three special types of associations; specialization/generalization (showing objects

    as subclasses of other objects), aggregation (showing objects as parts of a layer

    object), membership (showing that an object is a member of another higher

    object class with the same characteristics and behavior). Strict membership (or

    not) (all members belong to only one higher object class), Complete membership

    (or not) (all members belong to one higher object class and that object class is

    consisted by those members only).

    Record the associations between objects and facts: Facts are connected to

    objects.

    Distinguish dimensions and categorize them into hierarchies: dimensions

    governed by associations of type membership forming hierarchies that specify

    different granularities.

    4.5.1.The Dimensional Fact Model

    This model is built from ER schemas [9, 15, 16, 17, 33]. The Dimensional Fact

    (DF) Model is a collection of tree structured fact schemas whose elements are facts,

    attributes, dimensions and hierarchies. Fact attributes additivity, optional dimension

    attributes and non-dimension attributes existence may also be represented on fact

    schemas. Compatible fact schemas may be overlapped in order to relate and compare

    data.

    A fact schema is structured as a tree whose root is a fact. The fact is represented by

    a box which reports the fact name.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    43/105

    28

    Figure 4.12 A dimensional fact schema sample

    Sub-trees rooted in dimensions are hierarchies. The circles represent the attributes

    and the arcs represent relationship between attribute pairs. The non-dimension attributes

    (address attribute as shown in Figure 4.12) are represented by lines instead of circles. Anon-dimension attribute contains additional information about an attribute of the

    hierarchy, is connected to it by a -to-one relationship and cannot be used for

    aggregation. The arcs represented by dashes express optional relationships between pairs

    of attributes.

    A fact expresses a many-to-many relationship among the dimensions. Each

    combination of values of the dimensions defines a fact instance, one value for each fact

    attribute. Most attributes are additive along all dimensions. This means that the sumoperator can be used to aggregate attribute values along all hierarchies. A fact attribute is

    called semi-additive if it is not additive along one or more dimensions, non-additive if it

    is additive along no dimension.

    DF model consists of 5 steps;

    Defining facts (a fact may be represented on the E/R schema either by an entity F

    or by an n-ary relationships between entities E1 to En).

    For each fact;o Building the attribute tree. (Each vertex corresponds to an attribute of the

    schema; the root corresponds to the identifier of F; for each vertex v, the

    corresponding attribute functionally determines all the attributes

    corresponding to the descendants of v. If F is identified by the

    combination of two or more attributes, identifier (F) denotes their

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    44/105

    29

    concatenation. It is worth adding some further notes: It is useful to

    emphasize on the fact schema the existence of optional relationships

    between attributes in a hierarchy. Optional relationships or optional

    attributes of the E/R schema should be marked by a dash; A one-to-one

    relationship can be thought of as a particular kind of many-to-one

    relationship, hence, it can be inserted into the attribute tree;

    Generalization hierarchies in the E/R schema are equivalent to one-to-one

    relationships between the super-entity and each sub-entity; x-to-many

    relationships cannot be inserted into the attribute tree. In fact,

    representing these relationships at the logical level, for instance by a star

    schema, would be impossible without violating the first normal form; an

    n-ary relationship is equivalent to n binary relationships. Most n-aryrelationships have maximum multiplicity greater than 1 on all their

    branches; they determine n one-to-many binary relationships which

    cannot be inserted into the attribute tree.)

    o Pruning and grafting the attribute tree (Not all of the attributes

    represented in the attribute tree are interesting for the DW. Thus, the

    attribute tree may be pruned and grafted in order to eliminate the

    unnecessary levels of detail. Pruning is carried out by dropping any sub-

    tree from the tree. The attributes dropped will not be included in the fact

    schema, hence, it will be impossible to use them to aggregate data.

    Grafting is used when its descendants must be preserved.).

    o Defining dimensions (The dimensions must be chosen in the attribute tree

    among the children vertices of the root. E/R schemas can be classified as

    snapshot and temporal. A snapshot schema describes the current state of

    the application domain; old versions of data varying over time are

    continuously replaced by new versions. A temporal schema describes the

    evolution of the application domain over a range of time; old versions of

    data are explicitly represented and stored. When designing a DW from a

    temporal schema, time is explicitly represented as an E/R attribute and

    thus it is an obvious candidate to define a dimension. Time is not

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    45/105

    30

    explicitly represented however, should be added as a dimension to the

    fact schema).

    o Defining fact attributes (Fact attributes are typically either counts of the

    number of instances of F, or the sum/average/maximum/minimum of

    expressions involving numerical attributes of the attribute tree. A fact

    may have no attributes, if the only information to be recorded is the

    occurrence of the fact.).

    o Defining hierarchies (Along each hierarchy, attributes must be arranged

    into a tree such that an x-to-one relationship holds between each node and

    its descendants. It is still possible to prune and graft the tree in order to

    eliminate irrelevant details. It is also possible to add new levels of

    aggregation by defining ranges for numerical attributes. During thisphase, the attributes which should not be used for aggregation but only

    for informative purposes may be identified as non-dimension attributes.).

    4.5.2.Multidimensional E/R Model

    It is argued that ER approach is not suited for multidimensional conceptual

    modeling because the semantics of the main characteristics of the model cannot be

    effectively represented.

    Multidimensional E/R (ME/R) model includes some key considerations [14]:

    Specialization of the ER Model

    Minimal extension of the ER Model; this model should be easy to learn and use

    for an experienced ER Modeler. There are few additional elements.

    Representation of the multidimensional aspects; despite the minimality, the

    specialization should be powerful enough to express the basic multidimensional

    aspects, namely the qualifying and quantifying data and the hierarchical structure

    of the qualifying data.

    This model allows the generalization concepts. There are some specializations:

    A special entity set: dimension level

    Two special relationship sets connecting dimension levels:

    o a special n-ary relationship set: the fact relationship set

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    46/105

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    47/105

    32

    By modeling the multidimensional cube as a relationship set it is possible to

    include an arbitrary number of facts in the schema thus representing a multi-

    cube model. Remarkably the schema also contains information about the

    granularity level on which the dimensions are shared.

    Concerning measures and their structure, the ME/R model allows record

    structured measures as multiple attributes for one fact relationship set. The

    semantic information that some of the measures are derived cannot be included

    in the model. Like the E/R model the ME/R model captures the static structure of

    the application domain. The calculation of measures is functional information

    and should not be included in the static model. An orthogonal functional model

    should capture these dependencies.

    Schema contains rolls-up relationship between entities. Therefore levels of

    different dimensions may roll up to a common parent level. This information can

    be used to avoid redundancies.

    This model is used is a relationship.

    ME/R and ER models notations can be used together.

    Figure 4.14 shows multiple cubes that share dimensions on different levels.

    Figure 4.14 Multiple cubes sharing dimensions on different levels

    As mentioned above, the ME/R and ER model notations can be used together asillustrated in Figure 4.15.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    48/105

    33

    Figur e 4.15 Combining ME/R notations with E/R

    4.5.3.starER

    This model combines star structure with constructs of ER model [13]. The starER

    contains facts, entities, relationships and attributes. This model has the following

    constructs:

    Fact set: represents a set of real world facts sharing the same characteristics or

    properties. It is always associated with time. It is represented as a circle.

    Entity set: represents a set of real world objects with similar properties. It is

    represented as a rectangle. Relationship set: represents a set of associations among entity sets or among

    entity sets and fact sets. Its cardinality can be many-to-many, many-to-one, one-

    to-many. It is represented as a diamond. Relationship sets among entity sets can

    be type of specialization/generalization, aggregation and membership. Figure

    4.16 shows the notation for relationship set types.

    Figur e 4.16 Notation used in starER

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    49/105

    34

    Attribute: static properties of entity sets, relationship sets, fact sets. It is

    represented as an oval.

    Fact properties can be of type stock (S) (the state of something at a specific point

    in time), flow (F) (the commutative effect over a period of time for some

    parameter in the DW environment and which is always summarized) or value-

    per-unit (V) (measured for a fixed-time and the resulted measures are not

    summarized).

    The following criteria are satisfied by the starER schema;

    Explicit hierarchies in dimensions

    Symmetric treatment of dimensions and summary attributes (properties)

    Multiple hierarchies in each dimension

    Support for correct summary or aggregation

    Support of non-strict hierarchies

    Support of many-to-many relationships between facts and dimensions

    Handling different levels of granularity at summary properties

    Handling uncertainty

    Handling change and time

    There following list shows the main differences between DF Schema and starER model;

    Relationships between dimensions and facts in starER arent only many-to-one,

    but also many-to-many, which allows for better understanding of the involved

    information.

    Object participating in the data warehouse, but not in the form of a dimension are

    allowed in the starER.

    Specialized relationships on dimensions are permitted

    (specialization/generalization, aggregation, membership) and represent more

    information. DF requires only a rather straight forward transformation to fact and dimension

    tables. This is an advantage of DF Schema. But this is not a drawback for the

    starER model, since well-known rules of how to transform an ER Schema

    (Which is the basic structural difference between the two approaches) to relations

    do exist.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    50/105

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    51/105

    36

    (base class). An association of classes specifies the relationships between two levels of a

    classification hierarchy. These classes must define DAG (Directed Acyclic Graph)

    rooted in the dimension class. The DAG structure can represent both alternative path and

    multiple classification hierarchies. Descriptor attribute ({D}) define in every class that

    represents a classification hierarchy level. Strictness means that an object at a

    hierarchys lower level belongs to only one higher level object. Completeness means

    that all members belong to one higher-class object and that object consists of those

    members only. OOMD approach uses a generalization-specialization relationship to

    categorize entities that contain subtypes.

    Cube classes represent initial user requirements as the starting point for subsequent data-

    analysis phase. Cube classes contain;

    Head area; contains the cube classs name.

    Measures area; contains the measures to be analyzed.

    Slice area; contains the constraints to be satisfied.

    Dice area; contains the dimensions and their grouping conditions to address the

    analysis.

    Cube operations; cover the OLAP operations for a further data analysis phase.

    4.6.

    Logical Design ModelsDW logical design involves the definition of structures that enable an efficient

    access to information. The designer builds multidimensional structures considering the

    conceptual schema representing the information requirements, the source databases, and

    non functional (mainly performance) requirements. This phase also includes

    specifications for data extraction tools, data loading processes, and warehouse access

    methods. At the end of logical design phase, a working prototype should be created for

    the end-user.

    Dimensional models represent data with a cube structure, making more

    compatible logical data representation with OLAP data management. The objectives of

    dimensional modeling are [10]:

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    52/105

    37

    To produce database structures that are easy for end-users to understand and

    write queries against,

    To maximize the efficiency of queries.

    It achieves these objectives by minimizing the number of tables and relationships

    between them. Normalized databases have some characteristics that are appropriate for

    OLTP systems, but not for DWs [7]:

    Its structure is not easy for end-users to understand and use. In OLTP systems

    this is not a problem because, usually end-users interact with the database

    through a layer of software.

    Data redundancy is minimized. This maximizes efficiency of updates, but tends

    to penalize retrievals. Data redundancy is not a problem in DWs because data is

    not updated on-line.

    Dimensionality modeling uses the ER Modeling with some important restrictions.

    Dimensional model composed of one table with a composite primary key, called fact

    table, and a set of smaller tables called dimension tables. Each dimension table has a

    simple (non-composite) primary key that corresponds exactly to one of the components

    of the composite key in the fact table. This characteristic structure is called star schema

    or star join.

    Another important feature, all natural keys are replaced with surrogate keys. This

    means that every join between fact and dimension tables is based on surrogate keys, not

    natural keys. Each surrogate key should have a generalized structure based on simple

    integers. The use of surrogate keys allows the data in the DW to have some

    independence from the data used and produced by the OLTP systems.

    4.6.1.

    Dimensional Model Design

    This section describes a method for developing a dimensional model from an EntityRelationship model [12].

    This data model is used by OLTP systems. It contains no redundancy, but high

    efficiency of updates, shows all data and relationships between them. Simple queries

    require multiple table joins and complex subqueries. It is suitable for technical specialist.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    53/105

    38

    Classify Entities: For producing a dimensional model from ER model, first

    classify the entities into three categories.

    o Transaction Entities: These entities are the most important entities in a

    DW. They have highest precedence. They construct fact tables in star

    schema. These entities record details about particular events (orders,

    payments, etc.) that decision makers want to understand and analyze.

    There are some characteristics;

    It describes an event that occurs at a point in time.

    It contains measurements or quantities that may be summarized

    (sales amount, volumes)

    o Component Entities: These entities are directly related with a transaction

    entity with a one-to-many relationship. They have lowest precedence.They define the details or components of each transaction. They answer

    the who, what, when, where, how and why of event

    (customer, product, period, etc.). Time is an important component of any

    transaction. They construct dimension tables in star schema.

    o Classification Entities : These entities are related with component entities

    by a chain of one-to-many relationship. They are functionally dependent

    on a component entity. These entities represent hierarchies embedded in

    the data model, which may be collapsed in to component entity to form

    dimension tables in star schema.

    Identify Hierarchies: Most dimension tables in star schema include embedded

    hierarchies. A hierarchy is called maximal if it cannot be extended upwards or

    downwards by including another entity. An entity is called minimal if it has no

    one-to-many relationship. An entity is called maximal if it has no many-to-one

    relationship.

    Produce Dimensional Models: There are two operators to produce dimensional

    models from ER.

    o Collapse Hierarchy: Higher level entities can be collapsed into lower

    level entities within hierarchies. Collapsing a hierarchy is a form of

    denormalization. This increases redundancy in the form of a transitive

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    54/105

    39

    dependency, which is a violation to 3NF. We can continue doing this

    until we reach the bottom of the hierarchy and end up with a single table.

    o Aggregation: This operator can be applied to a transaction entity to create

    a new entity containing summarized data.

    There are 8 models used in dimensional modeling [6, 12]:

    Flat Schema

    Terraced Schema

    Star Schema

    Fact Constellation Schema

    Galaxy Schema

    Snowflake Schema

    Star Cluster Schema

    Starflake Schema

    4.6.2.Flat Schema

    This schema is the simplest schema. This is formed by collapsing all entities in the

    data model down into the minimal entities. This minimizes the number of tables in the

    database and joins in the queries. We end up with one table for each minimal entity in

    the original data model [12].

    This structure does not lose information from the original data model. It contains

    redundancy, in the form of transitive and partial dependencies, but does not involve any

    aggregation. It contains some problems; first it may lead to aggregation errors when

    there are hierarchical relationships between transaction entities. When we collapse

    numerical amounts from higher level transaction entities in to other they will be

    repeated. Second this schema contains large number of attributes.

    Therefore while the number of tables (system complexity) is minimized, thecomplexity of each table (element complexity) is increased. Figure 4.18 shows a sample

    flat schema.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    55/105

    40

    Figur e 4.18 Flat Schema

    4.6.3.Terraced Schema

    This schema is formed by collapsing entities down maximal hierarchies, end withwhen they reach a transaction entity. This results in a single table for each transaction

    entity in the data model. It causes some problems for inexperienced user, because the

    separation between levels of transaction entities is explicitly shown [12]. The Figure

    4.19 illustrates a sample terraced schema.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    56/105

    41

    Figur e 4.19 Terr aced Schema

    4.6.4.

    Star Schema

    It is the basic structure for a dimensional model. It has one fact table and a set of

    smaller dimension tables arranged around the fact table. The fact data will not change

    over time. The most useful fact tables are numeric and additive because data warehouse

    applications almost never access a single record. They access hundreds, thousands,

    millions of records at a time and aggregate them. The fact table is linked to all the

    dimension tables by one to many relationships. It contains measurements which may be

    aggregated in various ways [10, 12, 39].

    Dimension tables contain descriptive textual information. Dimension attributes are

    used as the constraints in the data warehouse queries. Dimension tables provide the basisfor aggregating the measurements in the fact table. They generally consist of embedded

    hierarchies.

    Each star schema is formed in the following way;

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    57/105

    42

    A fact table is formed for each transaction entity. The key of the table is the

    combination of the keys of its associated component entities.

    A dimension table is formed for each component entity, by collapsing

    hierarchically related classification entities into it.

    Where hierarchical relationships exist between transaction entities, the child

    entity inherits all dimensions (and key attributes) from the parent entity. This

    provides the ability to drill down between transaction levels.

    Numerical attributes within transaction entities should be aggregated by key

    attributes (dimensions). The aggregation attributes and functions used depend on

    the application.

    Star schemas can be used to speed up query performance by denormalizing

    reference information into a single dimension table. Denormalization is appropriate

    when there are a number of entities related to the dimension table that are often

    accessed, avoiding the overhead of having to join additional tables to access those

    attributes. Denormalization is not appropriate where the additional data is not accessed

    very often, because the overhead of scanning the expanded dimension table may not be

    offset by gain in the query performance.

    The advantage of using this schema; it reduces the number of tables in the database

    and the number of relationships between them and also the number of joins required in

    user queries. The Figure 4.20 shows a sample star schema.

    Figur e 4.20 Star Schema

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    58/105

    43

    4.6.5.Fact Constellation Schema

    A fact constellation schema consists of a set of star schemas with hierarchically

    linked fact tables. The links between the various fact tables provide the ability to drill

    down between levels of detail[10, 12]. The following figure, Figure 4.21, illustrates a

    sample of a fact constellation schema.

    Figure 4.21 Fact Constellation Schema

    4.6.6.Galaxy Schema

    Galaxy schema is a schema where multiple fact tables share dimension tables.

    Unlike a fact constellation schema, the fact tables in a galaxy do not need to be directly

    related [12]. The following figure, Figure 4.22, illustrates a sample of a galaxy schema.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    59/105

    44

    Figure 4.22 Galaxy Schema

    4.6.7.

    Snowflake SchemaIn a star schema, hierarchies in the original data model are collapsed or

    denormalized to form dimension tables. Each dimension table may contain multiple

    independent hierarchies. A snowflake schema is a variant of star schema with all

    hierarchies explicitly shown and dimension tables do not contain denormalized data [10,

    12].

    The many-to-one relationships among sets of attributes of a dimension can separate

    new dimension tables, forming a hierarchy. The decomposed snowflake structurevisualizes the hierarchical structure of dimensions very well.

    A snowflake schema can be produced by the following procedure:

    A fact table is formed for each transaction entity. The key of the table is the

    combination of the keys of the associated component entities.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    60/105

    45

    Each component entity becomes a dimension table.

    Where hierarchical relationships exist between transaction entities, the child

    entity inherits all relationships to component entities (and key attributes) from

    the parent entity.

    Numerical attributes within transaction entities should be aggregated by the key

    attributes. The attributes and functions used depend on the application.

    The following figure, Figure 4.23, illustrates a sample of a snowflake schema.

    Figure 4.23 Snowflake Schema

    4.6.8.Star Cluster Schema

    While snowflake contains fully expanded hierarchies, which adds complexity to the

    schema and requires extra joins, star schema contains fully collapsed hierarchies, which

    leads to redundancy. So, the best solution may be a balance between these two schemas

    [12]. Overlapping dimensions can be identified as forks in hierarchies. A fork occurs

    when an entity acts as a parent in two different dimensional hierarchies. Fork entities can

    be identified as classification entities with multiple one-to-many relationships. In Figure

    4.24, Region is parent of both Location and Customer entities and the fork occurs at the

    Region entity.

  • 8/11/2019 A Comparison of Data Warehouse Design Models

    61/105

    46

    Figur e 4.24 StarSchema with fork

    A star cluster schema is a star schema which is selectively snowflaked to separate

    out hierarchical segments or sub dimensions which are shared between different

    dimensions.

    A star cluster schema has the minimal number of tables while avoiding overlapbetween dimensions.

    A star cluster schema can be produced by the following procedure:

    A fact table is formed for each transaction entity. The key of the table is the

    combination of the keys of the associated component entities.

    Classification entities should be collapsed down their hierarchies until they reach

    either a fork entity or a component entity. If a fork is reached, a sub dimension

    table should be formed. The sub dimension table will consist of the fork entityplus all its ancestors. Collapsing should begin again after the fork entity. When a

    component entity is reached, a dimension table should be formed.

    Where hierarchical relationships exist between transaction entities, the child

    entity should inherit all dimensions (and key attributes) from the parent entity.