successful dimensional modeling of very large data warehouses by bert scalzo, ph.d....

42
Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. [email protected]

Upload: suzanna-morton

Post on 23-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Successful Dimensional Modeling of Very Large Data Warehouses

By Bert Scalzo, [email protected]

Page 2: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Learning Objectives

Application Nature versus Data Modeling Approach Important DW/DM Concepts for “Star Schema” Design Transforming a simple data model into a “Star Schema” Why Hierarchies are better than Snowflakes Common Aggregation/Summarization Themes Recommendations for Implementing Facts Recommendations for Indexes and Keys Oracle Issues (not modeling topic, but always asked for)

– Partitioning Options– Indexing Options– Tuning Star Queries– Materialized Views

Page 3: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Speaker’s Qualifications Oracle Solutions Product Architect for Quest Software Chief architect for Quest’s popular “TOAD” product Oracle DBA for 20+ years, versions 4 through 10g Worked for Oracle Education & Consulting Holds several Oracle Masters (DBA & CASE) BS, MS, PhD in Computer Science and also an MBA LOMA insurance industry designations: FLMI and ACS Books

– The TOAD Handbook (Feb 2003)– Oracle DBA Guide to Data Warehousing and Star Schemas (Mar

2003)– The TOAD Pocket Reference 2nd edition (June 2005)

Articles– Oracle Magazine– Oracle Technology Network (OTN)– Oracle Informant– PC Week (now E-Magazine)– Linux Journal– www.Linux.com

Page 5: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

About Quest Software Quest Software (NASDAQ: QSFT) Founded: 1987 More than 2000 employees in 40 offices: North

America, South America, Europe, Asia, Australia Application management leader: 75% of Fortune

500 Develop, deploy, manage and maintain

enterprise applications without downtime or business interruption

Best known in the Oracle community for TOAD, Spotlight, Quest Central, Shareplex, etc.

Page 6: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

The Architect will create the first high level drawings to validate the concept with the client and then make a more detailed plan (i.e. the blueprint ) for the Contractor …

The Contractor will take thisblueprint and optimise it basedon technical constraints. The Contractor will then create the actual office.

Would you build an office without a blueprint?

Why do we model?

Page 7: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Where in Development Lifecycle

Design

Develop Deploy

Monitor&

Maintain

Reengineer

Analysis

Conceptual

Physical

Some shops just treat this as one big “Design” task

Not uncommon for Star Schema data model to concentrate more on physical design characteristics

Page 8: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

World of Modeling …

• Identify all data & relationships - E/R (Entity/Rel’ship) diagrams - DB independent view• Business Rules?

Conceptual Data Modeling(CDM – E/R)

Physical Data Modeling(PDM)

Business Process Modeling(BPM)

Object-Oriented Modeling(OOM - UML)

• DB-specific model• Reverse engineer existing DB• Create/Update DB from model• Data Warehouse Modeling

• DBA• DB Developer• DB Architect

• Bus. Analyst • Data Architect• Data Analyst

• System Architect• System Analyst• App Developer

• End-user• IT Partner/Liaison• Business Analyst

• Support for all UML diagrams - Analyze requirements - Design application• Reverse/forward engineer code

• Improve process efficiency• Define/document Bus. Processes - create correct and complete application requirements

Quest’s “QDesigner” synchronizes models from all levels in a single tool

Page 9: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Know Your Application …What type of application are you building:

On Line Transaction Processing (OLTP)

Operational Data Store (ODS)

On Line Analytical Processing (OLAP)

Data Mart / Data Warehouse (DM/DW)

Page 10: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Warehouse ArchitectureOLTP

App #1VSAM

OLTPApp #4ISAM

OLTPApp #2Sybase

OLTPApp #3Oracle

ODSOracle

EnterpriseDW

ETL

StagingArea

ET L

DM 1 DM 2

Page 11: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

OLTP ODS OLAP DM/DWBusiness Focus

Operational Operational Tactical

Tactical Tactical Strategic

End User Tools

Client Server Web

Client Server Web

Client Server Client Server Web

DB Technology

Relational Relational Cubic Relational

Trans Count Large Medium Small Small

Trans Size Small Medium Medium Large

Trans Time Short Medium Long Long

Size in Gigs 10 – 200 50 – 400 50 – 400 400 - 4000

Normalization 3NF 3NF N/A 0NF

Data Modeling Traditional ER

Traditional ER N/A Dimensional

Application Natures…

Page 12: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Embrace New Concepts

“Teach Old Dog New Tricks”

Throw out any OLTP baggage

Forget OLTP “Golden Rules” X

Page 13: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Star Schema Design

“Star schema” approach to dimensional data modeling was pioneered by Ralph Kimball

Dimensions: smaller, de-normalized tables containing business descriptive columns that end-users query on

Facts: very large tables with primary keys formed from the concatenation of related dimension table foreign key columns, and possessing numerically additive, non-key columns used for calculations during end-user queries

Page 14: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Dimensions

Facts

Page 15: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

108th -1010th

103rd -105th

Page 16: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Transform OLTP Model

Fold OLTP model into itself to form a Star:

De-Normalize parent/child relationships

De-Normalize lookup relationships

Use surrogate or meaningless keys

Create and populate a time dimension

Create hierarchies of data in dimensions

Page 17: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

OLTP Model

Page 18: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Dimensional Model

Page 19: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Dimension HierarchiesSQL> select distinct levelx from dw_period;

LEVELX--------------------DAYMONTHQUARTERWEEKYEAR

SQL> select distinct levelx from dw_product;

LEVELX--------------------ALL PRODUCTSCATEGORYITEMPSASUB_CATEGORY

Page 20: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Avoid Snowflakes

Avoid natural desire to normalize model:

Complicates end-user query construction

Adds additional level of “JOIN” complexity

Database optimizers do not handle very well

Saves some space at the cost of longer queries

Page 21: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Snowflake Model

Page 22: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Common AggregationsBuild end-user driven aggregate tables:

By time (e.g. week, month, quarter, year)

By geographic regions (e.g. time zones)

By end-user reporting interests (e.g. beer)

By dimension hierarchy (e.g. product category)

Aggregates should be 5 to 10 times smaller

Page 23: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Time Aggregates

Page 24: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Non-Time Aggregates

Page 25: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Index Design

One Very Simple Rule:

All fact table, foreign key columns must have individual bitmap indexes on them

All dimension table columns should each have individual bitmap indexes

Page 26: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Nighttime - 10 B-Tree Indexes

Page 27: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Daytime - 48 Bitmap Indexes!!!

Page 28: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Bit-map indexes– Contrary to widespread belief, can be effective when

there are many distinct column values– Not suitable for OLTP however

0.01

0.1

1

10

100

1 10 100 1,000 10,000 100,000 1,000,000

Distinct values

Ela

psed tim

e (

s)

Bitmap index B*-Tree index Full table scan

Page 29: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Key Fact Table IssuesFact tables should: NOT create or enable foreign key constraints

(exception – MV’s need FK’s for query rewrites) NOT create or enable table check constraints NOT create or enable primary/unique constraints

(use unique indexes which offer parallel creation) NOT create or enable column check constraints

(other than simple NOT NULL check constraints) NOT create or enable “row” level triggers NOT enable logging on tables or their indexes

Page 30: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

No PK/UK/FK Constraints

Page 31: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Key Oracle Issues … Trust me – no way to build a large DW/DM in

Oracle 7.X (don’t recommend 8.X either)

Very brief overview in next few slides of:– Partitioning options– Indexing options– Comparative timings– Tuning ad-hoc Star queries– Serial versus Parallel queries– Materialized Views …

Page 32: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Oracle Partitioning

Way beyond the scope of dimensional modeling, but:

Use Range or List Partitioning using time dimension Fact unique index = local, prefixed b-tree index Fact time index = local, prefixed bitmap index Fact non-time index = local, non-prefixed bitmap index If any non-time dimension provides a good locality of

reference for typical user queries, then sub-partition on that dimension (i.e composite partitioning) – but note that under non-ideal data distributions, things could be worse or sometime even much worse…

Page 33: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

TABLE

RELATIONALOBJECT

TABLE INCLUSTER

TABLE INTABLESPACE

ORG INDEX ORG HEAP

CLUSTERINDEX

NON-CLUSTER

INDEX

INDEX NON-PARTITION

INDEX NON-PARTITION

GLOBAL GLOBAL

1. BTREE 2. BTREE

3. BITM AP

TABLE NON-PARTITION

TABLEPARTITION

INDEX NON-PARTITION

GLOBAL

4. BTREE

5. BITM AP

INDEXPARTITION

GLOBAL

6. BTREE

INDEX NON-PARTITION

GLOBAL

7. BTREE

8. BITM AP

INDEXPARTITION

GLOBAL LOCAL

9. BTREE 10. BTREE

11. BITM AP

TABLE-IZEDINDEX

INDEX NON-PARTITION

GLOBAL

12. BTREE

Indexing Options …

Page 34: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Query Time vs. Table Design

NOTE: specific to my data and user queries

Fact Implementation Timing

Regular “Heap” Table 9,293

Single Column Partition 4,747

Multi Column Partition 4,987

Composite Partition 6,319

Index Organized Table 12,508

Partition Index Organized 14,902

Page 35: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Tuning Star Queries …

Way beyond the scope of dimensional modeling, but:

Use Range Partitioning based upon your time dimension (do not try to force use of hash or composite partitioning)

Fact unique index uses local, prefixed b-tree index

Fact time index uses local, prefixed bitmap index

Fact non-time index use local, non-prefixed bitmap index

Page 36: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Query: beer and coffee sales for November of 98 in Dallas

Example BI Generated Query

Page 37: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Star Transformation

Star Transformation Explain

Page 38: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Star join performance

0.06

3.43

59.86

0 10 20 30 40 50 60

Elapsed time (s)

STAR hint

Cost Based (no STAR)

Rule Based

3 orders of magnitude difference between best and worst plan

Page 39: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

NOTE: specific to my data and user queries

Explain Plan UNIX NT

Serial, No Partition 9,688 22,344

Serial, with Partition 5,578 11,625

Parallel, No Partition ORA-600

ORA-600

Parallel, with Partition 11,140 25,454

Query Time vs. Serial/Parallel

Page 40: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Oracle Materialized ViewsWay beyond the scope of dimensional modeling, but:

Special form of snapshots (i.e. replication)

End-users direct all queries against detail table

Optimizer rewrites queries to use best aggregate

Optimizer suggests new aggregates based on load

Eliminates need for numerous aggregation programs

Page 41: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Exercise caution when creating materialized views

0 50,000 100,000 150,000 200,000 250,000 300,000

Logical IO

w ith Materialized view &MV log

Without materialized view

Insert into sales Maintain MV log Update MV

Conclusion: Better to rebuild MV after load – not concurrent with load

Page 42: Successful Dimensional Modeling of Very Large Data Warehouses By Bert Scalzo, Ph.D. Bert.Scalzo@Quest.com

Parting Thoughts …

To be successful, all modelers’ mindset must change from an OLTP to DW/DM paradigm

There are many other key/core data modeling issues – this was just but one of them …– Breaking models into sub-models– Repository-based collaborative modeling– Modeling the relationships between OLTP and DW models– Documenting the meta-data for OLTP ETL transformations– Modeling the Business Requirements– Object-Relational Mapping– etc, etc, etc …