multidimensional data model

28
ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 28 M3: Multidimensional Data Model The only way to do great work is to love what you do. -- Steve Jobs -- WORAPOT JAKKHUPAN, PHD [email protected] ROOM BSC.0406/7 Information and Communication Technology Programme, Faculty of Science, PSU

Upload: worapot-jakkhupan

Post on 12-Feb-2017

709 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 28

M3: Multidimensional Data Model

The only way to do great work is to love what you do. -- Steve Jobs --

W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7

I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U

Page 2: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 28

Outline

• Review

• Dimensional Modeling• Fact tables

• Dimensions• Facts

• Dimension tables• Attributes

• OLAP operations• Roll-up, Drill-down• Slice, Dice• Pivot

Page 3: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 28

Fact Table and Measures

• Most data in data warehouse is in fact tables, which can be extremely large

• Read-only data that will not change over time

• Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record. Measures are normally:• Numeric• Additive

• List of dimensions defines the grain of the fact table• The dimensions are foreign keys (FK) that connects to primary keys of

Dimension Tables

• Primary key of the fact table is combination of the foreign keys in the fact table• composite key

Page 4: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 28

Example: Fact Table

Sale Fact Table

Date_ID (fk)

Product_ID (fk)

Store_ID (fk)

Customer_ID(fk)

Items_sold

Sale_value Facts

Dimensions

Page 5: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 28

Dimension Tables

• Dimension tables have many columns (or attributes)• contain less number of rows than fact table

• make data in the data warehouse usable and understandable

• Primary key is referenced by foreign key of fact table

• Dimension tables usually contain descriptive textual information

• Dimension attributes are used as conditions in data warehouse queries

• In star schema, dimension table is de-normalized to improve query performance

Page 6: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 28

Product Dimension Table

Product_ID (pk)

Name

Description

Category

Weight

Package type

Sale Fact Table

Date_ID (fk)

Product_ID (fk)

Store_ID (fk)

Customer_ID(fk)

Items_sold

Sale_valueCustomer Dimension Table

Customer_ID (pk)

Name

Address

Gender

Store Dimension Table

Store_ID (pk)

Brance_name

Address

Province

Region

Date Dimension Table

Date_ID (pk)

day

month

year

day_of_week

Page 7: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 28

SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amt

p1 1 62

p2 1 19

p1 2 48

drill-down

rollup

sale prodId storeId date amt

p1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

ER Aggregation

Page 8: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 28

Why de-normalized?

• Imagine a table that contains a million records join with another one table that also contains a million records

• What will happen when joining both table?• DW must compare 107x107 times (worse case)

Sale Fact Table

Date_ID (fk)

Product_ID (fk)

Store_ID (fk)

Customer_ID(fk)

Items_sold

Sale_value

Product Dimension Table

Product_ID (pk)

Name

Description

Category

Weight

Package type

Page 9: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 28

OLAP Tier

• Three types of OLAP severs:

• Relational OLAP (ROLAP) servers store data in relational databases and support extensions to SQL and special access methods to efficiently implement the multidimensional data model and the related operations.

• E.g., star schemas, snowflake schemas, and constellation schemas.

• Multidimensional OLAP (MOLAP) servers directly store multidimensional data in special data structures (for instance, arrays) and implement the OLAP operations over those data structures.

• While MOLAP systems offer less storage capacity than ROLAP systems, MOLAP systems provide better performance when multidimensional data is queried or aggregated.

• Hybrid OLAP (HOLAP) servers combine both technologies, benefiting from the storage capacity of ROLAP and the processing capabilities of MOLAP. For example, a HOLAP server may store large volumes of detailed data in a relational database, while aggregations are kept in a separate MOLAP store.

Page 10: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 28

OLAP Terminology

Key Term Definition

OLAP Database The container for different objects that are included in an analysis service solution.

Data Source A database that provides data for an OLAP database

Dimension The structural building locks of a cube

Hierarchies Including two types: Attribute hierarchies (are built using the properties of the dimension) and user-defined hierarchies

(defining the method in which a cube can be sliced on a particular dimension)

Level Indentifies a position within a hierarchy to which individual items (known as members) belong.

Member Objects within a hierarchy that represent one or more instances of fact data.

Measures They represent quantifiable fact data in your database.

Measure groups Used to associate dimensions with the measures from underlying fact tables as well as when a distinct count is used as the

aggregation behaviour for fact data.

Cube Primary objects created in an OLAP database. Two main components to a cube: the dimensions (the structure), and the

measures (referenced data).

Page 11: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 28

Multidimensional Databases

• A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of hierarchical called dimensions.• where each cell represents a business event, and the value of the

dimensions shows when and where this event happened.

• It stores the aggregate values as well as the base values, typically in compressed multidimensional array format, rather than in RDBMS tables. Aggregate values are pre-computed summaries of the base values.

• Multidimensional databases are typically used for business intelligence (BI), especially for online analytical processing (OLAP) and data mining (DM).

Page 12: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 28

Multidimensional Databases

• Advantages of using multidimensional databases for OLAP and DM

• less disk space and have better performance because it is compressed and because it does not use indexing like a relational database (it uses multidimensional offsetting to locate the data).

• It performs better on OLAP operations because the aggregates are pre-calculated and because the way the data is physically stored (compressed multidimensional array format with offset positioning) minimizes the number of IO operations (disk reads).

• Drawbacks

• the processing time required for loading the database and calculating the aggregate values. Whenever the relational source is updated, the MDB needs to be updated or reprocessed; in other words, the aggregate cells need to be recalculated (it doesn’t have to be done in real time).

• The second drawback is the scalability: an MDB may not scale well for a very large database (multiple terabytes) or a large number of dimensions.

Page 13: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 28

sale prodId storeId amt

p1 s1 12p2 s1 11p1 s3 50p2 s2 8

s1 s2 s3

p1 12 50p2 11 8

Fact Table (RDBMS) Multi-dimensional cube:

dimensions = 2

MOLAP Cube

Page 14: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 28

dimensions = 3

Multi-dimensional cube:Fact Table (RDBMS)

sale prodId storeId date amt

p1 s1 1 12

p2 s1 1 11

p1 s3 1 50

p2 s2 1 8

p1 s1 2 44

p1 s2 2 4

day 2 s1 s2 s3

p1 44 4

p2 s1 s2 s3

p1 12 50p2 11 8

day 1

3-D MOLAP Cube

Page 15: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 28

dimensions = 3

- Product Dimension

- Location Dimension

- Time Dimension

3-D MOLAP Cube with Hierarchy

Region 1

p1 62

p2 19

p1 48

p2Day 2

Day 1

We

ek 1

Aggregated by location

City 2

s1 s2 s3

p1 12 50

p2 11 8

p1 44 4

p2Day 2

Day 1

City 1

We

ek 1

Region 1

Page 16: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 28

Concept Hierarchies

• A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts.

• For example, suppose dimension location is described by the attributes number, street, city, province or state, zipcode, and country. we have a concept hierarchy in 4 levels: “street < city < province or state < country”.

• The attributes of a dimension may be organized in a partial order, forming a lattice. For example, time dimension, has a partial order

“day < {month <quarter; day of week} < year”.

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 17: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 28

Lattice of Cuboid

item, city, year, and sales_in_Euro

(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

Page 18: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 28

3D, 2L

• {(),

• (c1), (c2), (i1), (i2), (y1), (y2),

• (c1,i1), (c1,i2), (c2,i1), (c2,i2),

• (c1,y1), (c1,y2), (c2,y1), (c2,y2),

• (i1,y1), (i1,y2), (i2,y1), (i2,y2),

• (c1,i1,y1), (c1,i1,y2), (c1,i2,y1), (c1,i2,y2),

• (c2,i1,y1), (c2,i1,y2), (c2,i2,y1), (c2,i2,y2)}

(21) 33 27i1

3

Page 19: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 28

OLAP Operations

• In a multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies.

• OLAP provides an user-friendly environment for interactive data analysis.

• A number of OLAP data cube operations exist to materialize these different views, allowing interactive querying and analysis of the data at hand.

Page 20: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 20 of 28

Types of OLAP Operations

• Roll up (drill-up): summarize data

• by climbing up hierarchy or by dimension reduction

• Drill down (roll down): reverse of roll-up

• from higher level summary to lower level summary or detailed data, or introducing new dimensions

• Slice and dice:

• project and select

• Pivot (rotate):

• reorient the cube, visualization, 3D to series of 2D planes.

• Other operations

• drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

Page 21: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 21 of 28

Food Line Outdoor Line CATEGORY_total

Asia 59,728 151,174 210,902

Food Line Outdoor Line CATEGORY_total

Malaysia 618 9,418 10,036

China 33,198.5 74,165 107,363.5

India 6,918 0 6,918

Japan 13,871.5 34,965 48,836.5

Singapore 5,122 32,626 37,748

Belgium 7797.5 21,125 28,922.5

Drill-Down

Page 22: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 28

Roll-Up

Food Line Outdoor Line CATEGORY_total

Canada 29,116.5 69,310 98,426.5

Mexico 12,743.5 24,284 37,027.5

United States 102,561.5 232,679 335,240.5

Food Line Outdoor Line CATEGORY_total

North America 144,421.5 326,273 470,694.5

Page 23: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 28

Slice

Food Line Outdoor Line CATEGORY_total

North America 144,421.5 326,273 470,694.5

992,481690,751301,730REGION_total

470,694.5326,273144,421.5North America

310,884.5213,30497,580.5Europe

210,902151,17459,728Asia

CATEGORY_tot

al

Outdoor

Line

Food

Line

992,481690,751301,730REGION_total

470,694.5326,273144,421.5North America

310,884.5213,30497,580.5Europe

210,902151,17459,728Asia

CATEGORY_tot

al

Outdoor

Line

Food

Line

Page 24: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 28

Food Line Outdoor Line

Mexico 12,743.5 24,284

United States 102,561.5 232,679

Dice

Food Line Outdoor Line CATEGORY_total

Canada 29,116.5 69,310 98,426.5

Mexico 12,743.5 24,284 37,027.5

United States 102,561.5 232,679 335,240.5

Page 25: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 28

MOLAP in Pentaho

Page 26: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 28

Generate Reports

Page 27: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 28

Exercise 1

Country

- Country_ID

- Country_Name

State

- State_ID

- State_Name

- Country_ID

City

- City_ID

- City_Name

- State_ID

Sale

- Time_ID

- Product_ID

- Store_ID

- Sale

- Order

Store

- Store_ID

- Store_Name

- City_ID

Time

- Time_ID

- DayOfWeek

- Month

- Quarter

- Year

Product

- Product_ID

- Product_Name

- ProductType_ID

Product_Type

- ProductType_ID

- ProductType_Name

= 1:m

Sale = ( )Order = ( )

Page 28: Multidimensional Data Model

ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 28

Exercise 2

Time:Year: 2003 Time:Year: 2004

Product:Type:

Sport cars

Product:Type:

Classic cars

Product:Type:

Sport cars

Product:Type:

Classic cars

Location:

Country:

Location:

State:Sale Order Sale Order Sale Order Sale Order

Australia NSW 5 2 10 5 2 2 1 5

Queensland 1 5 1 1 2 8 10 5

Victoria 8 2 1 1 10 5

USA CA 10 5 2 1 12 6 5 2

CT 4 1 5 8 1 1 15 10

NY 20 10 15 2 25 10 20 5

NJ 10 5 8 2 15 10