multidimensional data model
TRANSCRIPT
ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 28
M3: Multidimensional Data Model
The only way to do great work is to love what you do. -- Steve Jobs --
W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7
I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U
ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 28
Outline
• Review
• Dimensional Modeling• Fact tables
• Dimensions• Facts
• Dimension tables• Attributes
• OLAP operations• Roll-up, Drill-down• Slice, Dice• Pivot
ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 28
Fact Table and Measures
• Most data in data warehouse is in fact tables, which can be extremely large
• Read-only data that will not change over time
• Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record. Measures are normally:• Numeric• Additive
• List of dimensions defines the grain of the fact table• The dimensions are foreign keys (FK) that connects to primary keys of
Dimension Tables
• Primary key of the fact table is combination of the foreign keys in the fact table• composite key
ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 28
Example: Fact Table
Sale Fact Table
Date_ID (fk)
Product_ID (fk)
Store_ID (fk)
Customer_ID(fk)
Items_sold
Sale_value Facts
Dimensions
ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 28
Dimension Tables
• Dimension tables have many columns (or attributes)• contain less number of rows than fact table
• make data in the data warehouse usable and understandable
• Primary key is referenced by foreign key of fact table
• Dimension tables usually contain descriptive textual information
• Dimension attributes are used as conditions in data warehouse queries
• In star schema, dimension table is de-normalized to improve query performance
ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 28
Product Dimension Table
Product_ID (pk)
Name
Description
Category
Weight
Package type
Sale Fact Table
Date_ID (fk)
Product_ID (fk)
Store_ID (fk)
Customer_ID(fk)
Items_sold
Sale_valueCustomer Dimension Table
Customer_ID (pk)
Name
Address
Gender
Store Dimension Table
Store_ID (pk)
Brance_name
Address
Province
Region
Date Dimension Table
Date_ID (pk)
day
month
year
day_of_week
ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 28
SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId date amt
p1 1 62
p2 1 19
p1 2 48
drill-down
rollup
sale prodId storeId date amt
p1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
ER Aggregation
ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 28
Why de-normalized?
• Imagine a table that contains a million records join with another one table that also contains a million records
• What will happen when joining both table?• DW must compare 107x107 times (worse case)
Sale Fact Table
Date_ID (fk)
Product_ID (fk)
Store_ID (fk)
Customer_ID(fk)
Items_sold
Sale_value
Product Dimension Table
Product_ID (pk)
Name
Description
Category
Weight
Package type
ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 28
OLAP Tier
• Three types of OLAP severs:
• Relational OLAP (ROLAP) servers store data in relational databases and support extensions to SQL and special access methods to efficiently implement the multidimensional data model and the related operations.
• E.g., star schemas, snowflake schemas, and constellation schemas.
• Multidimensional OLAP (MOLAP) servers directly store multidimensional data in special data structures (for instance, arrays) and implement the OLAP operations over those data structures.
• While MOLAP systems offer less storage capacity than ROLAP systems, MOLAP systems provide better performance when multidimensional data is queried or aggregated.
• Hybrid OLAP (HOLAP) servers combine both technologies, benefiting from the storage capacity of ROLAP and the processing capabilities of MOLAP. For example, a HOLAP server may store large volumes of detailed data in a relational database, while aggregations are kept in a separate MOLAP store.
ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 28
OLAP Terminology
Key Term Definition
OLAP Database The container for different objects that are included in an analysis service solution.
Data Source A database that provides data for an OLAP database
Dimension The structural building locks of a cube
Hierarchies Including two types: Attribute hierarchies (are built using the properties of the dimension) and user-defined hierarchies
(defining the method in which a cube can be sliced on a particular dimension)
Level Indentifies a position within a hierarchy to which individual items (known as members) belong.
Member Objects within a hierarchy that represent one or more instances of fact data.
Measures They represent quantifiable fact data in your database.
Measure groups Used to associate dimensions with the measures from underlying fact tables as well as when a distinct count is used as the
aggregation behaviour for fact data.
Cube Primary objects created in an OLAP database. Two main components to a cube: the dimensions (the structure), and the
measures (referenced data).
ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 28
Multidimensional Databases
• A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of hierarchical called dimensions.• where each cell represents a business event, and the value of the
dimensions shows when and where this event happened.
• It stores the aggregate values as well as the base values, typically in compressed multidimensional array format, rather than in RDBMS tables. Aggregate values are pre-computed summaries of the base values.
• Multidimensional databases are typically used for business intelligence (BI), especially for online analytical processing (OLAP) and data mining (DM).
ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 28
Multidimensional Databases
• Advantages of using multidimensional databases for OLAP and DM
• less disk space and have better performance because it is compressed and because it does not use indexing like a relational database (it uses multidimensional offsetting to locate the data).
• It performs better on OLAP operations because the aggregates are pre-calculated and because the way the data is physically stored (compressed multidimensional array format with offset positioning) minimizes the number of IO operations (disk reads).
• Drawbacks
• the processing time required for loading the database and calculating the aggregate values. Whenever the relational source is updated, the MDB needs to be updated or reprocessed; in other words, the aggregate cells need to be recalculated (it doesn’t have to be done in real time).
• The second drawback is the scalability: an MDB may not scale well for a very large database (multiple terabytes) or a large number of dimensions.
ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 28
sale prodId storeId amt
p1 s1 12p2 s1 11p1 s3 50p2 s2 8
s1 s2 s3
p1 12 50p2 11 8
Fact Table (RDBMS) Multi-dimensional cube:
dimensions = 2
MOLAP Cube
ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 28
dimensions = 3
Multi-dimensional cube:Fact Table (RDBMS)
sale prodId storeId date amt
p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4
day 2 s1 s2 s3
p1 44 4
p2 s1 s2 s3
p1 12 50p2 11 8
day 1
3-D MOLAP Cube
ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 28
dimensions = 3
- Product Dimension
- Location Dimension
- Time Dimension
3-D MOLAP Cube with Hierarchy
Region 1
p1 62
p2 19
p1 48
p2Day 2
Day 1
We
ek 1
Aggregated by location
City 2
s1 s2 s3
p1 12 50
p2 11 8
p1 44 4
p2Day 2
Day 1
City 1
We
ek 1
Region 1
ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 28
Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts.
• For example, suppose dimension location is described by the attributes number, street, city, province or state, zipcode, and country. we have a concept hierarchy in 4 levels: “street < city < province or state < country”.
• The attributes of a dimension may be organized in a partial order, forming a lattice. For example, time dimension, has a partial order
“day < {month <quarter; day of week} < year”.
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 28
Lattice of Cuboid
item, city, year, and sales_in_Euro
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 28
3D, 2L
• {(),
• (c1), (c2), (i1), (i2), (y1), (y2),
• (c1,i1), (c1,i2), (c2,i1), (c2,i2),
• (c1,y1), (c1,y2), (c2,y1), (c2,y2),
• (i1,y1), (i1,y2), (i2,y1), (i2,y2),
• (c1,i1,y1), (c1,i1,y2), (c1,i2,y1), (c1,i2,y2),
• (c2,i1,y1), (c2,i1,y2), (c2,i2,y1), (c2,i2,y2)}
(21) 33 27i1
3
ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 28
OLAP Operations
• In a multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies.
• OLAP provides an user-friendly environment for interactive data analysis.
• A number of OLAP data cube operations exist to materialize these different views, allowing interactive querying and analysis of the data at hand.
ICT@PSU 308-471 Data Warehousing and Data Mining 20 of 28
Types of OLAP Operations
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data, or introducing new dimensions
• Slice and dice:
• project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
• drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
ICT@PSU 308-471 Data Warehousing and Data Mining 21 of 28
Food Line Outdoor Line CATEGORY_total
Asia 59,728 151,174 210,902
Food Line Outdoor Line CATEGORY_total
Malaysia 618 9,418 10,036
China 33,198.5 74,165 107,363.5
India 6,918 0 6,918
Japan 13,871.5 34,965 48,836.5
Singapore 5,122 32,626 37,748
Belgium 7797.5 21,125 28,922.5
Drill-Down
ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 28
Roll-Up
Food Line Outdoor Line CATEGORY_total
Canada 29,116.5 69,310 98,426.5
Mexico 12,743.5 24,284 37,027.5
United States 102,561.5 232,679 335,240.5
Food Line Outdoor Line CATEGORY_total
North America 144,421.5 326,273 470,694.5
ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 28
Slice
Food Line Outdoor Line CATEGORY_total
North America 144,421.5 326,273 470,694.5
992,481690,751301,730REGION_total
470,694.5326,273144,421.5North America
310,884.5213,30497,580.5Europe
210,902151,17459,728Asia
CATEGORY_tot
al
Outdoor
Line
Food
Line
992,481690,751301,730REGION_total
470,694.5326,273144,421.5North America
310,884.5213,30497,580.5Europe
210,902151,17459,728Asia
CATEGORY_tot
al
Outdoor
Line
Food
Line
ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 28
Food Line Outdoor Line
Mexico 12,743.5 24,284
United States 102,561.5 232,679
Dice
Food Line Outdoor Line CATEGORY_total
Canada 29,116.5 69,310 98,426.5
Mexico 12,743.5 24,284 37,027.5
United States 102,561.5 232,679 335,240.5
ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 28
MOLAP in Pentaho
ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 28
Generate Reports
ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 28
Exercise 1
Country
- Country_ID
- Country_Name
State
- State_ID
- State_Name
- Country_ID
City
- City_ID
- City_Name
- State_ID
Sale
- Time_ID
- Product_ID
- Store_ID
- Sale
- Order
Store
- Store_ID
- Store_Name
- City_ID
Time
- Time_ID
- DayOfWeek
- Month
- Quarter
- Year
Product
- Product_ID
- Product_Name
- ProductType_ID
Product_Type
- ProductType_ID
- ProductType_Name
= 1:m
Sale = ( )Order = ( )
ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 28
Exercise 2
Time:Year: 2003 Time:Year: 2004
Product:Type:
Sport cars
Product:Type:
Classic cars
Product:Type:
Sport cars
Product:Type:
Classic cars
Location:
Country:
Location:
State:Sale Order Sale Order Sale Order Sale Order
Australia NSW 5 2 10 5 2 2 1 5
Queensland 1 5 1 1 2 8 10 5
Victoria 8 2 1 1 10 5
USA CA 10 5 2 1 12 6 5 2
CT 4 1 5 8 1 1 15 10
NY 20 10 15 2 25 10 20 5
NJ 10 5 8 2 15 10