data models for warehouse session-12/13 data management for decision support

61
Data Models for Warehouse Session-12/13 Data Management for Decision Support

Upload: everett-sims

Post on 13-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Data Models for WarehouseData Models for Warehouse

Session-12/13

Data Management for Decision Support

Page 2: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Data ModelsData Models

Data Models relations stars & snowflakes cubes

Operators slice & dice roll-up, drill down pivoting other

Page 3: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Data ModelsData Models

Star schemas are database schemas that exploit the structure of data for decision support query Queries in DSS tend to

Examine a set of factual transactions- POS, Customer events

Facts are analyzed in variety of ways - POS transaction by week, or store

For example a retail store POS is at the center Product information - SKU, hierarchy of ( section dept, BU) Time information - day, week, month, year Stores - Store-id, hierarchy (regions, city, locality) Suppliers- Sup-id, location, discounts

Page 4: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Data ModelsData Models

Sales Transactions

Products Time

SuppliersStores

Information is split between two classes- Factual information and Reference information

Page 5: Data Models for Warehouse Session-12/13 Data Management for Decision Support

FACT DATAFACT DATA

Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions

Typically 70% of Warehouse data is Fact data Important to identify and define structure right in

the first place as restructuring is an expensive process

Detail content of FACT is derived from the business requirement

Recorded Facts do not change as they are events of past

Page 6: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Dimension DataDimension Data

Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores

It is the reference data used for analysis of Facts

Organizing the information in separate reference tables offers better query performance

It differs from Fact data as it changes over time, due to changes in business, reorganization

It should be structured to permit rapid changes

Page 7: Data Models for Warehouse Session-12/13 Data Management for Decision Support

FACT and Dimensions FACT and Dimensions

Millions to billions of rows

Multiple foreign keys Numeric Does not change

Tens to millions of rows

One primary key Textual decription Frequently modifies

Page 8: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Decision Support QueriesDecision Support Queries

Examples Average number of sales of Haldiram per store

over last month (various types within the brand) Projected sales of Deepavali gift packs against

the actual The top 20% customers (spending) over last

quarter The customers with average balance in excess of

Rs. 25000 for past one year ==> Each of these queries is based on Factual

data

Page 9: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Decision Support QueriesDecision Support Queries

Examples

POS Transaction

Membership card Transaction

Account transactions

Sales of Haldiram

Customer Spend

Account Balance

Quantity SoldProductStore Date, TimeRevenue Realized

Customer-IdStoreTransaction ValueDate and Time

CustomerAC numbertype of transactionamount

Page 10: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star SchemaStar Schema

The star schema is a data-modeling technique used to map multidimensional decision support into a relational database.

Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.

Four Components: Facts Dimensions Attributes Attribute hierarchies

Page 11: Data Models for Warehouse Session-12/13 Data Management for Decision Support

A Simple Star Schema

Page 12: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star SchemaStar Schema

Facts Facts are numeric measurements (values) that

represent a specific business aspect or activity.

The fact table contains facts that are linked through their dimensions.

Facts can be computed or derived at run-time (metrics).

Dimensions Dimensions are qualifying characteristics that provide

additional perspectives to a given fact.

Dimensions are stored in dimension tables.

Page 13: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Identifying Facts and DimensionsIdentifying Facts and Dimensions

Elemental Transaction

Determine Key Dimensions

Check if Fact is a dimension

Check if dimensions is a Fact

Page 14: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Identification: Step 1Identification: Step 1

Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis

These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking

For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT

The limit of current recording should not influence Warehouse design

Page 15: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Identification: Step 1Identification: Step 1

Sector and Business

Retail

SalesShrinkage

Retail Banking Customer profiling ProfitabilityInsurance Product ProfitabilityTelecom Call Analysis Customer Analysis

Fact Table

POS Transaction

Stock movement and position

Customer eventsAccount transactions

Claims and receipts

Call eventsCustomer events(install, disconnect, payment)

Page 16: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Identification: Step 2Identification: Step 2

Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities.

These are candidate References, the task is to find key dimension entities that may not be directly associated.

For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice.

Analyze account transaction by account? Analyze how customers use our services? You store both relationships but customer becomes a

dimension

Page 17: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Identification: Step3Identification: Step3

FACT is not actually a denormalized dimension table Consider the following:

house-details Cable-laid Sales-persons visit connected to the service promotional material sent subscription cancelled …

Home-details - candidate fact Operational events Report on number of connections quarter-to-date Time-lag between laying and subscrition

Page 18: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Identification: Step 4Identification: Step 4

Dimension is not a FACT Lot depends on DSS requirements-

Customer can be FACT or Dimension Promotions can be fact or dimensions

Ask questions using other dimensions- Using how many other dimensions, Can I view this entity.

Can I view promotion by Time? Can I view promotions by product? Can I view promotion by store? Can I vie promotions by suppliers?

If answer to these question is yes, then it is a FACT

Page 19: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star SchemaStar Schema

Attributes Each dimension table contains attributes. Attributes are

often used to search, filter, or classify facts. Dimensions provide descriptive characteristics about

the facts through their attributes.

Possible Attributes For Sales Dimensions

Page 20: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Three Dimensional View Of Sales

Page 21: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Slice And Dice View Of Sales

Page 22: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star SchemaStar Schema

Attribute Hierarchies

Attributes within dimensions can be ordered in a well-defined attribute hierarchy.

The attribute hierarchy provides a top-down data organization that is used for two main purposes:

Aggregation

Drill-down/roll-up data analysis

Page 23: Data Models for Warehouse Session-12/13 Data Management for Decision Support

A Location Attribute Hierarchy

Page 24: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Attribute Hierarchies In Multidimensional Analysis

Page 25: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star SchemaStar Schema

Star Schema Representation

Facts and dimensions are normally represented by physical tables in the data warehouse database.

The fact table is related to each dimension table in a many-to-one (M:1) relationship.

Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Page 26: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star Schema For Sales

Page 27: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Orders Star Schema

Page 28: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The Multi-Dimensional ModelThe Multi-Dimensional Model

“Sales by product line over the past six months”

“Sales by store between 1990 and 1995”

Prod Code Time Code Store Code Sales Qty

Store Info

Product Info

Time Info

. . .

Numerical MeasuresKey columns joining fact table

to dimension tables

Fact table for measures

Dimension tables

Page 29: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Dimensional ModelingDimensional Modeling

Dimensions are organized into hierarchies E.g., Time dimension: days weeks quarters E.g., Product dimension: product product line brand

Dimensions have attributes

Page 30: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Dimension Hierarchies

Store Dimension Product Dimension

District

Region

Total

Brand

Manufacturer

Total

Stores Products

Page 31: Data Models for Warehouse Session-12/13 Data Management for Decision Support

ROLAP: Dimensional Modeling Using Relational DBMS

ROLAP: Dimensional Modeling Using Relational DBMS

Special schema design: star, snowflake Special indexes: bitmap, multi-table join Special tuning: maximize query throughput Proven technology (relational model, DBMS), tend to

outperform specialized MDDB especially on large data sets Products

IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

Page 32: Data Models for Warehouse Session-12/13 Data Management for Decision Support

MOLAP: Dimensional Modeling Using the Multi Dimensional Model

MOLAP: Dimensional Modeling Using the Multi Dimensional Model

MDDB: a special-purpose data model Facts stored in multi-dimensional arrays Dimensions used to index array Sometimes on top of relational DB Products

Pilot, Arbor Essbase, Gentia

Page 33: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star Schema (in RDBMS)Star Schema (in RDBMS)

Page 34: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star Schema ExampleStar Schema Example

Page 35: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Star Schema with Sample Data

Page 36: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Classic” Star Schema

A single fact table, with detail and summary data

Fact table primary key has only one key column per dimension

Each key is generated Each dimension is a single table,

highly denormalized

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata

Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagResolutionSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level

Product Desc.BrandColorSizeManufacturerLevel

STORE KEY

Page 37: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Classic” Star Schema

The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error.

Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A

where A.STORE_KEY in (select STORE_KEYfrom Store_Dimension Bwhere region = “North” and Level = 2)

and etc...

Level is neededwhenever aggregates are stored with detail facts.

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagResolutionSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level

Product Desc.BrandColorSizeManufacturerLevel

STORE KEY

Page 38: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Level” Problem

Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur.

One alternative: the FACT CONSTELLATION model...

Page 39: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Fact Constellation” Schema

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY

DollarsUnitsPrice

Region Fact Table

Region_IDPRODUCT_KEYPERIOD_KEY

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.

Product Desc.BrandColorSizeManufacturer

STORE KEY

Page 40: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Fact Constellation” Schema

In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, forexample, Store detail when queryingthe District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail

Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY

DollarsUnitsPrice

Region Fact Table

Region_IDPRODUCT_KEYPERIOD_KEY

PERIOD KEY

Store Dimension Time Dimension

Product Dimension

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Period DescYearQuarterMonthDayCurrent FlagSequence

Fact Table

PRODUCT KEY

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.

Product Desc.BrandColorSizeManufacturer

STORE KEY

Page 41: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Another Alternative to “Level”

Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay.

An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...

Page 42: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Snowflake” Schema

STORE KEY

Store Dimension

Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.

District_IDDistrict Desc.Region_ID

Region_ID

Region Desc.Regional Mgr.

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Store Fact Table

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY Dollars

UnitsPrice

RegionFact Table

Region_IDPRODUCT_KEYPERIOD_KEY

Page 43: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Snowflake” Schema

No LEVEL in dimension tables Dimension tables are normalized by

decomposing at the attribute level Each dimension table has one key for each

level of the dimensionís hierarchy The lowest level key joins the dimension table

to both the fact table and the lower level attribute table

How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

STORE KEY

Store Dimension

Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.

District_ IDDistrict Desc.Region_ ID

Region_ ID

Region Desc.Regional Mgr.

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Store Fact Table

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY Dollars

UnitsPrice

RegionFact Table

Region_IDPRODUCT_KEYPERIOD_KEY

Page 44: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The “Snowflake” Schema

Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content.

In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table.

Advantage: Best performance when queries involve aggregation

Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

STORE KEY

Store Dimension

Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.

District_ IDDistrict Desc.Region_ ID

Region_ ID

Region Desc.Regional Mgr.

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Store Fact Table

DollarsUnitsPrice

District Fact Table

District_IDPRODUCT_KEYPERIOD_KEY Dollars

UnitsPrice

RegionFact Table

Region_IDPRODUCT_KEYPERIOD_KEY

Page 45: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Advantages of ROLAP Dimensional ModelingAdvantages of ROLAP Dimensional Modeling

Define complex, multi-dimensional data with simple model

Reduces the number of joins a query has to process Allows the data warehouse to evolve with rel. low

maintenance HOWEVER! Star schema and relational DBMS are not

the magic solution Query optimization is still problematic

Page 46: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Aggregates

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

Page 47: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Aggregates

Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

Page 48: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Another Example

Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

Page 49: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Aggregates

Operators: sum, count, max, min, median, ave

“Having” clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

Page 50: Data Models for Warehouse Session-12/13 Data Management for Decision Support

ROLAP vs. MOLAP

ROLAP:Relational On-Line Analytical Processing

MOLAP:Multi-Dimensional On-Line Analytical Processing

Page 51: Data Models for Warehouse Session-12/13 Data Management for Decision Support

The MOLAP Cube

sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8

s1 s2 s3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

Page 52: Data Models for Warehouse Session-12/13 Data Management for Decision Support

3-D Cube

dimensions = 3

Multi-dimensional cube:Fact table view:

sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

Page 53: Data Models for Warehouse Session-12/13 Data Management for Decision Support

ExampleExample

Store

Pro

duct

Time

M T W Th F S S

Juice

Milk

Coke

Cream

Soap

Bread

NYSF

LA

10

34

56

32

12

56

56 units of bread sold in LA on M

Dimensions:Time, Product, Store

Attributes:Product (upc, price, …)Store ……

Hierarchies:Product Brand …Day Week QuarterStore Region Country

roll-up to week

roll-up to brand

roll-up to region

Page 54: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Cube Aggregation: Roll-up

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

s1 s2 s3p1 56 4 50p2 11 8

s1 s2 s3sum 67 12 50

sump1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Page 55: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Cube Operators for Roll-up

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

s1 s2 s3p1 56 4 50p2 11 8

s1 s2 s3sum 67 12 50

sump1 110p2 19

129

. . .

sale(s1,*,*)

sale(*,*,*)sale(s2,p2,*)

Page 56: Data Models for Warehouse Session-12/13 Data Management for Decision Support

s1 s2 s3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129

Extended CubeExtended Cube

day 2 s1 s2 s3 *p1 44 4 48p2* 44 4 48s1 s2 s3 *

p1 12 50 62p2 11 8 19* 23 8 50 81

day 1

*

sale(*,p2,*)

Page 57: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Aggregation Using Hierarchies

region A region Bp1 56 54p2 11 8

store

region

country

(store s1 in Region A;stores s2, s3 in Region B)

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

Page 58: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Slicing

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

s1 s2 s3p1 12 50p2 11 8

TIME = day 1

Page 59: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Productsd1 d2

Store s1 Electronics $5.2Toys $1.9

Clothing $2.3Cosmetics $1.1

Store s2 Electronics $8.9Toys $0.75

Clothing $4.6Cosmetics $1.5

ProductsStore s1 Store s2

Store s1 Electronics $5.2 $8.9Toys $1.9 $0.75

Clothing $2.3 $4.6Cosmetics $1.1 $1.5

Store s2 ElectronicsToys

Clothing

($ millions)d1

Sales($ millions)

Time

Sales

Slicing &Pivoting

Page 60: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Summary of OperationsSummary of Operations

Aggregation (roll-up) aggregate (summarize) data to the next higher dimension element e.g., total sales by city, year total sales by region, year

Navigation to detailed data (drill-down) Selection (slice) defines a subcube

e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’ Calculation and ranking

e.g., top 3% of cities by average income Visualization operations (e.g., Pivot) Time functions

e.g., time average

Page 61: Data Models for Warehouse Session-12/13 Data Management for Decision Support

Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs,…)

Spreadsheet Systems Web Interfaces Data Mining