data warehouse and data mining - wordpress.comsnowflake schema • snowflake schema – advantages...

Naeem Ahmed

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data Warehouse and Data Mining Lecture No. 06

Data Modeling

DW Modeling •  Data Modeling

–  Conceptual Modeling: •  Multidimensional Entity Relationship (ME/R) Model •  Multidimensional UML (mUML)

–  Logical Modeling: •  Cubes, Dimensions, Hierarchies

–  Physical Modeling: •  Star, Snowflake, Array storage

Logical Model •  Goal of the Logical Model

–  Confirm the subject areas –  Create ‘real’ facts and dimensions from the subjects

that have been identified –  Establish the needed granularity for dimensions

•  Logical structure of the multidimensional model –  Cubes: Sales, Purchase, Price, Inventory –  Dimensions: Product, Time, Geography, Client

Logical Model

Dimensions •  Dimensions are…

–  analysis purpose chosen entities, within the data model •  One dimension can be used to define more than one cube •  They are hierarchically organized

Dimensions •  Dimension hierarchies are organized in

classification levels (e.g., Day, Month, ...) –  The dependencies between the classification levels are

described by the classification schema through functional dependencies

•  An attribute B is functionally dependent on an attribute A, denoted A ⟶ B, if for all a ∈ dom(A) there exists exactly one b ∈ dom(B) corresponding to it

Dimensions •  Classification schemas

–  The classification schema of a dimension D is a semi- ordered set of classification levels ({D.K0, ..., D.Kk}, ⟶ )

–  With a smallest element D.K0, i.e. there is no classification level with smaller granularity

–  A fully-ordered set of classification levels is called a Path •  If classification schema of the time dimension is considered, then one

has the following paths –  T.Day ⟶ T.Week and T.Day ⟶ T.Month ⟶ T.Quarter ⟶ T.Year

•  Here T.Day is the smallest element

Dimensions •  Classification hierarchies

–  Let D.K0 ⟶ ...⟶ D.Kk be a path in the classification schema of dimension D

–  A classification hierarchy concerning these path is a balanced tree which

•  Has as nodes dom(D.K0) U...U dom(D.Kk) U {ALL} •  And its edges respect the functional dependencies

Dimensions •  Example: classification hierarchy from the path

product dimension

Dimensions Store Dimension Product Dimension

District

Region

Total

Brand

Manufacturer

Total

Stores Products

Cubes •  Cubes consist of data cells with one or more

measures –  If a cube schema S(G,M) consists of a granularity G=

(D1.K1, ..., Dn.Kn) and a set M=(M1, ..., Mm) representing the measure

–  A Cube (C) is a set of cube cells, C ⊆ dom(G) x dom(M)

Cubes •  The coordinates of a cell are the classification

nodes from dom(G) corresponding to the cell –  Sales ((Article, Day, Store, Client) (Turnover)) –  …

Cubes •  4 dimensions (supplier, city, quarter, product)

Cubes •  One can now imagine n-dimensional cubes

–  n-D cube is called a base cuboid –  The top most cuboid, the 0-D, which holds the highest

level of summarization is called apex cuboid

- The full data cube is formed by the lattice of cuboids

Cubes •  But things can get complicated pretty fast

Basic Operations •  Basic operations of the multidimensional model on

the logical level –  Selection –  Projection –  Cube join –  Sum –  Aggregation

Basic Operations •  Multidimensional Selection

–  The selection on a cube C((D1.K1,..., Dg.Kg), (M1, ..., Mm)) through a predicate P, is defined as σP(C) = {z Є C:P(z)}, if all variables in P are either:

•  Classification levels K , which functionally depend on a classification level in the granularity of K, i.e. Di.Ki ⟶ K

•  Measures from (M1, ..., Mm)

–  E.g. σP.Prod_group=“Video”(Sales)

Basic Operations •  Multidimensional projection

–  The projection of a function of a measure F(M) of cube C is defined as

πF(M)(C) = { (g,F(m)) ∈ dom(G) x dom(F(M)): (g,m) ∈ C} –  E.g. , price projection πturnover, sold_items(Sales)

Basic Operations •  Join operations between cubes is usual

–  E.g. if turnover would not be provided, it could be calculated with the help of the unit price from the price cube

•  2 cubes C1(G1, M1) and C2(G2, M2) can only be joined, if they have the same granularity (G1= G2 = G) –  C1⋈C2= C(G, M1∪ M2)

Basic Operations •  When the granularities are different, but there is

still need to join the cubes, aggregation has to be performed

–  E.g. , Sales ⋈ Inventory: aggregate Sales((Day,Article, Store, Client)) to Sales((Month, Article, Store, Client))

Aggregation: A whole formed or calculated by the combination of many separate units or items – Total

Basic Operations •  Aggregation: most important operation for OLAP

operations –  Aggregation functions

•  Build a single values from set of value, e.g. in SQL: SUM,AVG, Count, Min, Max

•  Example: SUM(P.Product_group, G.City, T.Month)(Sales)

Change support •  Classification schema, cube schema, classification

hierarchy are all designed in the building phase and considered as fix –  Practice has proven otherwise –  DW grow old, too –  Changes are strongly connected to the time factor –  This lead to the time validity of these concepts

•  Reasons for schema modification –  New requirements –  Modification of the data source

Classification Hierarchy •  E.g. Saturn sells a lot of electronics

–  Lets consider mobile phones •  They built their DW on 01.03.2003 •  A classification hierarchy of their data until 01.07.2008 could

look like this:

Classification Hierarchy •  After 01.07.2008 3G becomes hip and affordable

and many phone makers start migrating towards 3G capable phones –  Lets say O2 makes its XDA 3G capable

Classification Hierarchy •  After 01.04.2010 phone makers already develop

4G capable phones

Classification Hierarchy •  It is important to trace the evolution of the data

–  It can explain which data was available at which moment in time

–  Such a versioning system of the classification hierarchy can be performed by constructing a validity matrix

•  When is something, valid? •  Use timestamps to mark it!

Classification Hierarchy •  Annotated Change data

Classification Hierarchy •  The tree can be stored as dimension metadata

–  The storage form is a validity matrix •  Rows are parent nodes •  Columns are child nodes

Classification Hierarchy •  Deleting a node in a classification hierarchy

–  Should be performed only in exceptional cases •  It can lead to information loss

–  How to solve it? •  Soon GSM phones will not be produced anymore •  But one might have some more in warehouses, to be delivered •  Or one might want to query data since when GSM was sold •  Just mark the end validity date of the GSM branch in the validity

matrix

Classification Hierarchy •  Query classification

–  Having the validity information we can support queries like as is versus as is

•  Regards all the data as if the only valid classification hierarchy is the present one

•  In the case of O2 XDA, it will be considered as it has always been a 3G phone

Classification Hierarchy •  As is versus as was

–  Orders the classification hierarchy by the validity matrix information

•  O2 XDA was a GSM phone until 01.07.2008 and a 3G phone afterwards

Classification Hierarchy •  As was versus as was

–  Past time hierarchies can be reproduced

–  E.g., query data with an older classification hierarchy

•  Like versus like –  Only data whose classification

hierarchy remained unmodified, is evaluated

–  E.g. the Nokia 3600 and the Black Berry

Schema Modification •  Improper modification of a schema (deleting a

dimension) can lead to –  Data loss –  Inconsistencies

•  Data is incorrectly aggregated or adapted

•  Proper schema modification is complex but –  It brings flexibility for the end user

•  The possibility to ask “As Is vs. As Was” queries and so on

•  Alternatives –  Schema evolution –  Schema versioning

Schema Modification •  Schema evolution

–  Modifications can be performed without data loss –  It involves schema modification and data adaptation to

the new schema –  This data adaptation process is called Instance

adaptation

Schema Modification •  Schema evolution

–  Advantage •  Faster to execute queries in DW with many schema

modifications

–  Disadvantages •  It limits the end user flexibility to query based on the past

schemas •  Only actual schema based queries are supported

Schema Modification •  Schema versioning

–  Also no data loss –  All the data corresponding to all the schemas are always

available –  After a schema modification the data is held in their

belonging schema •  Old data - old schema

•  New data - new schema

Schema Modification •  Schema versioning

–  Advantages •  Allows higher flexibility, e.g.,“As Is vs.As Was”, etc. queries

–  Disadvantages •  Adaptation of the data to the queried schema is done on the

spot •  This results in longer query run time

Physical Model •  Defining the physical structures

–  Setting up the database environment –  Performance tuning strategies

•  Indexing •  Partitioning •  Materialization

•  Goal –  Define the actual storage architecture –  Decide on how the data is to be accessed and how it is

arranged

Physical Model •  Physical implementation of the multidimensional

paradigm model can be: –  Relational

•  Snowflake-schema •  Star-schema •  Fast constellation

–  Multidimensional •  Matrixes

Physical Model •  Relational model, goals:

–  As low loss of semantically knowledge as possible e.g., classification hierarchies

–  The translation from multidimensional queries must be efficient

–  The RDBMS should be able to run the translated queries efficiently

–  The maintenance of the present tables should be easy and fast e.g., when loading new data

Relational Model •  Going from multidimensional to relational

–  Representations for cubes, dimensions, classification hierarchies and attributes

–  Implementation of cubes without the classification hierarchies is easy

•  A table can be seen as a cube •  A column of a table can be considered as a dimension mapping •  A tuple in the table represents a cell in the cube •  If one interprets only a part of the columns as dimensions, he/

she can use the rest as measures •  The resulting table is called a fact table

Relational Model

Relational Model •  Snowflake-schema

–  Simple idea: use a table for each classification level •  This table includes the ID of the classification level and other

attributes •  2 neighbor classification levels are connected by 1:n

connections e.g., from n Days to 1 Month •  The measures of a cube are maintained in a fact table •  Besides measures, there are also the foreign key IDs for the

smallest classification levels

Relational Model •  Snowflake?

–  The facts/measures are in the center –  The dimensions spread out in each direction and

branch out with their granularity

Snowflake Example

Snowflake Example Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

Snowflake Schema •  Snowflake schema – Advantages

–  With a snowflake schema the size of the dimension tables will be reduced and queries will run faster

•  If a dimension is very sparse (most measures corresponding to the dimension have no data)

•  And/or a dimension has long list of attributes which may be queried

•  Snowflake schema – Disadvantages –  Fact tables are responsible for 90% of the storage requirements

•  Thus, normalizing the dimensions usually lead to insignificant improvements –  Normalization of the dimension tables can reduce the performance of

the DW because it leads to a large number of tables •  E.g., when connecting dimensions with coarse granularity these tables are joined with

each other during queries •  A query which connects Product category with Year and Country is clearly not

performant (10 tables need to be connected)

Relational Model •  Star schema

–  Basic idea: use a denormalized schema for all the dimensions

•  A star schema can be obtained from the snowflake schema through the denormalization of the tables belonging to a dimension

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency

Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them

A de-normalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data

Star schema Example

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata

Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

Star Schema •  Advantages

–  Improves query performance for often-used data –  Less tables and simple structure –  Efficient query processing with regard to dimensions

•  Disadvantages –  In some cases, high overhead of redundant data

Star Schema The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error. Solution: FACT CONSTELLATION

Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A

where A.STORE_KEY in (select STORE_KEY from Store_Dimension B where region = “North” and Level = 2)

Level is needed whenever aggregates are stored with detail facts.

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagResolutionSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level

Product Desc.BrandColorSizeManufacturerLevel

STORE KEY

Fact Constellation Schema •  FACT Constellation Schema describes a logical

database structure of Data Warehouse or Data Mart •  It can design with collection of de-normalized FACT,

Shared and Conformed Dimension tables •  FACT Constellation Schema is an extended and

decomposed STAR Schema •  In Fact Constellations, aggregate tables are created

separately from the detail, therefore, it is impossible to pick up –  Example, Store detail when querying the District Fact Table

Fact Constellation Schema •  Fact Constellation is a good alternative to the Star, but when

dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay

•  An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema”

•  Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail

•  Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

Fact Constellation Example

Dollars Units Price

District Fact Table

District_ID PRODUCT_KEY PERIOD_KEY

Dollars Units Price

Region Fact Table

Region_ID PRODUCT_KEY PERIOD_KEY

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.

Product Desc.BrandColorSizeManufacturer

STORE KEY

Snowflake vs. Star •  Snowflake

–  The structure of the classifications are expressed in table schemas

–  The fact table and dimension tables are normalized

•  Star –  The entire classification

is expressed in just one table

–  The fact table is normalized while in the dimension table the normalization is broken

•  This leads to redundancy of information in the dimension tables

Snowflake vs. Star •  Snowflake •  Star

Snowflake vs. Star Attributes Star Schema Snowflake Schema Ease of maintenance / change

Has redundant data and hence less easy to maintain/change

No redundancy and hence more easy to maintain and change

Ease of Use Less complex queries and easy to understand

More complex queries and hence less easy to understand

Query Performance Less no. of foreign keys and hence lesser query execution time

More foreign keys-and hence more query execution time

Type of Datawarehouse Good for datamarts with simple relationships (1:1 or 1:many)

Good to use for datawarehouse core to simplify complex relationships (many:many)

Joins Fewer Joins Higher number of Joins Dimension table Contains only single

dimension table for each dimension

It may have more than one dimension table for each dimension

When to use When dimension table contains less number of rows, go for Star schema.

When dimension table is relatively big in size, snowflaking is better as it reduces space.

Normalization/ De-Normalization

Both Dimension and Fact Tables are in De-Normalized form

Dimension Tables are in Normalized form but Fact Table is still in De-Normalized form

Data model Top down approach Bottom up approach !

Snowflake to Star •  When should one go from Snowflake to star?

–  Heuristics-based decision •  When typical queries relate to coarser granularity (like product

category) •  When the volume of data in the dimension tables is relatively

low compared to the fact table –  In this case a star schema leads to negligible overhead through

redundancy, but performance is improved •  When modifications on the classifications are rare compared to

insertion of fact data –  In this case these modifications controlled through the data load

process of the ETL reducing the risk of data anomalies

Which one is winner? –  It depends on the necessity

•  Fast query processing or efficient space usage –  However, most of the time a mixed form is used

•  The Starflake schema: some dimensions stay normalized corresponding to the snowflake schema, while others are denormalized according to the star schema

–  Snowflake schema: The decision on how to deal with the dimensions is influenced by

–  Frequency of the modifications: if the dimensions change often, normalization leads to better results

–  Amount of dimension elements: the bigger the dimension tables, the more space normalization saves

–  Number of classification levels in a dimension: more classification levels introduce more redundancy in the star schema

–  Materialization of aggregates for the dimension levels: if the aggregates are materialized, a normalization of the dimension can bring better response time

Snowflake or Star?

More Schemas •  Galaxies

–  In pratice we usually have more measures described by different dimensions

•  Thus, more fact tables

More Schemas •  Fact constellations

–  Pre-calculated aggregates

•  Factless fact tables –  Fact tables do not have non-key data

•  Can be used for event tracking or to inventory the set of possible occurrences

•  Factless fact table does not have any measures •  For example, consider a record of student attendance in

classes. In this case, the fact table would consist of 3 dimensions: the student dimension, the time dimension, and the class dimension.

More Schemas •  Factless fact tables

•  This factless fact table would look like the following:

Relational Model •  Relational model – disadvantages

–  The representation of the multidimensional data can be implemented relationally with a finite set of transformation steps, however:

•  Multidimensional queries have to be first translated to the relational representation

•  A direct interaction with the relational data model is not fit for the end user

Multidimensional Model •  The basic data structure for multidimensional data

storage is the array •  The elementary data structures are the cubes and

the dimensions –  C=((D1, ..., Dn), (M1, ..., Mm))

•  The storage is intuitive as arrays of arrays, physically linearized

Multidimensional Model •  Linearization example: 2D cube |D1| = 5, |D2| = 4,

cube cells = 20 –  Query: Jackets sold in March? –  Measure stored in cube cell D1[4], D2[3] –  The 2D cube is physically stored as a linear array, so D1[4], D2[3] becomes array cell 14

•  (Index(D2) – 1) * |D1| + Index(D1) •  Linearized Index=2*5+4=14

Linearization •  Generalization:

–  Given a cube C=((D1, D2, ..., Dn), (M1:Type1, M2:Type2, ..., Mm:Typem)), the index of a cube cell z with coordinates (x1, x2, ..., xn) can be linearized as follows:

•  Index(z) = x1 + (x2 - 1) * |D1| + (x3 - 1) * |D1| * |D2| + ... +(xn -1)*|D1|*...*|Dn-1| = 1+∑i=1 n ((xi -1)*∏ j=1 i-1 |Di|)

Problems in Array-Storage •  Influence of the order of the dimensions in the

cube definition –  In the cube the cells of D2 are ordered one under the other e.g., sales of all pants involves a column in the cube –  After linearization, the information is spread among more data blocks/pages –  If one considers a data block can hold 5 cells, a query

over all products sold in January can be answered with just 1 block read, but a query of all sold pants, involves reading 4 blocks

Problems in Array-Storage •  Solution: use caching techniques

–  But...caching and swapping is performed also by the operating system

–  MDBMS has to manage its caches such that the OS doesn’t perform any damaging swaps

•  Storage of dense cubes –  If cubes are dense, array storage is more efficient. However,

operations suffer due to the large cubes –  Solution: store dense cubes not linear but on 2 levels

•  The first contains indexes and the second the data cells stored in blocks

•  Optimization procedures like indexes (trees, bitmaps), physical partitioning, and compression (run-length- encoding) can be used

Problems in Array-Storage •  Storage of sparse cubes

–  All the cells of a cube, including empty ones, have to be stored

–  Sparseness leads to data being stored in many physical blocks or pages

•  The query speed is affected by the large number of block accesses on the secondary memory

–  Solution: •  Do not store empty blocks or pages but adapt the index

structure •  2 level data structure: upper layer holds all possible

combinations of the sparse dimensions, lower layer holds dense dimensions

Problems in Array-Storage •  2 level cube storage

data warehouse and data mining - wordpress.comsnowflake schema • snowflake schema – advantages...

Documents