irm uk - 2009: dv modeling and methodology

59
06/22/2022 LearnDataVault.com 1

Upload: empowered-holdings-llc

Post on 19-Jan-2015

1.644 views

Category:

Technology


0 download

DESCRIPTION

This was a presentation I gave to IRM UK conference in November 2009. It covers some interesting details around the steps you should take to build your Data Vault, and an overview as to why re-engineering creeps in to your existing silo solutions.

TRANSCRIPT

Page 1: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 1

Page 2: IRM UK - 2009: DV Modeling And Methodology

Data Vault Modeling

MethodologyA Primer…

© Dan Linstedt 2009-2012All Rights Reserved

http://LearnDataVault.com

Page 3: IRM UK - 2009: DV Modeling And Methodology

3

A bit about me…

• Author, Inventor, Speaker – and part time photographer…

• 25+ years in the IT industry• Worked in DoD, US Gov’t, Fortune 50, and

so on…

• Find out more about the Data Vault:o http://www.youtube.com/LearnDataVaulto http://LearnDataVault.com

• Full profile on http://www.LinkedIn.com/dlinstedt

LearnDataVault.com

Page 4: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 4

What IS a Data Vault? (Business

Definition)

• Data Vault Modelo Detail orientedo Historical traceabilityo Uniquely linked set of

normalized tableso Supports one or more

functional areas of business

ProcurementSales DeliveryContracts

FinancePlanning

Operations

Business KeysSpan / CrossLines of Business

Functional Area

• Data Vault Methodology– CMMI Level 5 Project Plan– Risk, Governance, Versioning– Peer Reviews, Release Cycles– Repeatable, Consistent,

Optimized– Complete with Best Practices

for BI/DW

Page 5: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 5

What Does One Look Like?

Customer

Sat

Sat

Sat

F(x)

Customer

Product

Sat

Sat

Sat

F(x)

Product

Order

Sat

Sat

Sat

F(x)

Order

Elements:• Hub• Link• Satellite

Link

F(x)

Sat

Records a history of the interaction

Hub = List of Unique Business KeysLink = List of Relationships, AssociationsSatellites = Descriptive Data

Page 6: IRM UK - 2009: DV Modeling And Methodology

Who’s Using It?

04/10/2023LearnDataVault.com 6

Page 7: IRM UK - 2009: DV Modeling And Methodology

The PAIN!!Issues in Current EDW Projects

04/10/2023LearnDataVault.com 7

Page 8: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 8

EDW Architecture: Generation 1

Sales

Finance

Contracts

Staging(EDW)

StarSchemas

Enterprise BI Solution

(batch)

Staging + History

Complex Business

Rules

Complex Business Rules+Dependencies

Conformed DimensionsJunk Tables

Helper TablesFactless Facts

Page 9: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 9

Kick-Starting Data Warehousing

HR Asks IT to build the FIRST Data Warehouse / Prototype

1.

IT Says… OK: $125k and 90

days…

2.

HR Says:Great! Get

Started

3.

Page 10: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 10

Everyone’s Happy!

IT Delivers. On-Time & In Budget!

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

First Star!

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneFact_ABCFact_DEFFact_PDQFact_MYFACT

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

4.

HR Says:Thank-you! We’re

Happy!

5.

Page 11: IRM UK - 2009: DV Modeling And Methodology

So Where’s the PAIN?

04/10/2023LearnDataVault.com 11

Page 12: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 12

The PAIN is RIGHT HERE!!

Contracts Sees Success, wants the same for their systems.

1.

IT Says… Ok, but… It won’t be $125k and 90

days…Because we have to “merge

it” with HR” it will be $250 and 180 days.

2.

Contracts Says:Ouch! That’s not

reasonable, but we need it, so go ahead…

3.

Page 13: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 13

And HERE….

Finance, Sales, and Marketing want in….IT Says… Ok, but… It won’t be $250k and 90 days… Because we

have to “merge it” with HR and Contracts it will be $350k and 250 days.

And this continues….Business Says...“Can’t you just make-a-copy of the Star

Schema, and give me my own for cheaper & less time?

Page 14: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 14

Silo Building / IT Non-Agility

We built our own because IT costs too much

SALES

We built our own because IT took too long

FINANCE

We built our own because we need customized dimension data

MARKETING

First Star

Why is this happening? What’s Causing this Problem?

Page 15: IRM UK - 2009: DV Modeling And Methodology

Root Cause of Pain: Re-Engineering!

04/10/2023LearnDataVault.com 15

IT is forced to Re-Engineer ETL loading code + SQL BI Queries WHENEVER:• WHENEVER table

structures change• New systems are introduced

• Business Rules Change • (causing ETL Loading to change, and forcing Engineers to RELOAD existing data)

1. Adding fields to Dimensions

2. Adding fields to Facts

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneFact_ABCFact_DEFFact_PDQFact_MYFACT

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type

3. Adding Dimensions to Facts

Page 16: IRM UK - 2009: DV Modeling And Methodology

Why Re-Engineering?

04/10/2023LearnDataVault.com 16

Require Re-Engineering!

Adding fields to a conformed dimension….

Adding fields to a shared fact….

Require adding/changingFields in target tables!

Changing code to match new business rules…

Page 17: IRM UK - 2009: DV Modeling And Methodology

Other Pains?

04/10/2023LearnDataVault.com 17

Dimension-Itis?

Deformed Dimensions?

IT – Non-Agility?

WHAT ABOUT THE “DATA” YOU DON’T SEE?

WHAT ABOUT THE “BAD” DATA LEFT IN THE SOURCE SYSTEMS?

Page 18: IRM UK - 2009: DV Modeling And Methodology

The SolutionGo the Data Vault Route!

04/10/2023LearnDataVault.com 18

Page 19: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 19

EDW Architecture: Generation 2

Sales

Finance

Contracts

StagingDV

EDW

StarSchemas

ErrorMarts

ReportCollections

Enterprise BI SolutionSOA

(real-time)

(batch)

(batch)

Business Rules Downstream!(the Lens Filter)

Page 20: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 20

Unstructured Data And Data Vault

Unstructured Data Sets Ontologies/Taxonomies

• Email• Docs• Images• Movies• Sound

Unstructured Processing Engine

Data Vault EDW

Joins through LINK Structures

On-DemandCubes

Page 21: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 21

IT Agility

Source StagingData Vault

(EDW)

RAW“what-is”

StarSchemas

1. Fast Load & Fast Integration

ComplexBusiness

Rules

BusinessDriven

StarSchemas

3. IT Implementation of Business Rules

ETL-T

2. Business Gap Analysis•Unknown Time…•Business Requirements•Start new phase

Page 22: IRM UK - 2009: DV Modeling And Methodology

What are the Facts Jack?

04/10/2023LearnDataVault.com 22

Generation 1 EDW’s tried to provide“One version of the truth”

Generation 2 (Data Vaults) provide…“One version of the facts, for each point in time.”

Page 23: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 23

Business Gap Analysis

GapAnalysis

The Way Business Perceives it’s business to be running

The way the source systems see the business running.

GapAnalysis

OperationalReports

DynamicCubes(Data Marts)

Page 24: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 24

Secured/Protected Information Systems

Non-Classified DV

HubHub

SatSatSatSat

HubHub

SatSat

LinkLinkSatSat

Classified Data Vault

HubHub

SatSatSatSat

HubHub

SatSatSatSat

LinkLink

SatSat

Hub

SatSat

Link

Sat

Sat

• Model changes are absorbed seamlessly into the classified system• Classified world can add all their own structures while maintaining congruence with standard

unclassified Data Vault

Data Copy

Model Copy

Yellow = New Tables

Page 25: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 25

Extensibility Factor

ProductSupplierLink

Product ShippedDates

BilledAmounts Product

Quantities

Suppliers

Descriptions

Stock Quantities

Address

Rating Score

Products

Descriptions

Stock Quantities

Availability Dates

Defect Reasons

Existing EDWNo Impact!

New AdditionsNew Code

Page 26: IRM UK - 2009: DV Modeling And Methodology

Where’s the Solution?

04/10/2023LearnDataVault.com 26

Handle Changes Wherever… Whenever… with EASE!

Re-Engineering

Page 27: IRM UK - 2009: DV Modeling And Methodology

The Three vehicles…Pros and Cons of the Modeling Methodologies

04/10/2023LearnDataVault.com 27

Page 28: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 28

3rd Normal Form Pros/Cons as an

EDWPROS (as 3NF)• Many to many linkages• Handle lots of information• Tightly integrated information• Highly structured• Conducive to near-real time loads• Relatively easy to extend

CONS (as EDW)• Time driven PK issues• Parent-child complexities• Cascading change impacts• Difficult to load• Not conducive to BI tools• Not conducive to drill-down• Difficult to architect for an

enterprise• Not conducive to spiral/scope

controlled implementation• Physical design usually doesn’t

follow business processes

Page 29: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 29

Star Schema Pros/Cons as an EDW

PROS (as Data Mart)• Good for multi-dimensional analysis• Subject oriented answers• Excellent for aggregation points• Rapid development / deployment• Great for some historical storage

CONS (as EDW)• Not cross-business functional• Use of junk / helper tables• Trouble with VLDW• Unable to provide integrated

enterprise information• Can’t handle ODS or exploration

warehouse requirements• Trouble with data explosion in

near-real-time environments• Trouble with updates to type 2

dimension primary keys• Trouble with late arriving data in

dimensions to support real-time arriving transactions

• Not granular enough information to support real-time data integration

Page 30: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 30

Data Vault Pros/Cons as an EDW

PROS (as EDW)• Supports near-real time and

batch feeds• Supports functional business

linking• Extensible / flexible• Provides rapid build / delivery

of star schema’s• Supports VLDB / VLDW• Designed for EDW• Supports data mining and AI• Provides granular detail• Incrementally built

CONS (as EDW)• Not conducive to OLAP

processing• Requires business

analysis to be firm• Introduces many join

operations

Page 31: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 31

The Three Vehicles…

• Which would you use to win a race?• Which would you use to move a house?• Would you adapt the truck and enter a race with Porches and expect to

win?

Page 32: IRM UK - 2009: DV Modeling And Methodology

#1 complaint about DV architecture

So you want to deal with Joins do you?

04/10/2023LearnDataVault.com 32

Page 33: IRM UK - 2009: DV Modeling And Methodology

Joins, Everywhere!

04/10/2023LearnDataVault.com 33

Yes, the DV is full of joins but…These are highly normalized tables (thin & Narrow), reducing I/O’s to read large numbers of rows, at high speed, in parallel. Joins occur in RAM instead of on disk. The Optimizer is given a chance to “drop tables” from the join that aren’t necessary.

When Parallelism is too much…• Not enough CPU or RAM to handle the extra work-load• Not enough rows being queried, (the overhead of starting the threads

takes longer than an original scan.

End Result? The DV Scales to the Petabyte Levels when necessary…

Page 34: IRM UK - 2009: DV Modeling And Methodology

Mathematics Behind the Data Vault Model

*** The Data Vault is BACKED by Mathematical Principles***

• Parallel versus sequential execution models• Set Logic• I/O Bandwidth & Throughput• Compression (for query performance gains)• Process Repeatability (tuning & predictability

measurements)• RAM versus electromagnetic disk (Solid-State

Drives are not measured)

• http://osl.cs.uiuc.edu/docs/IPDPS-TR04/TCA_TR04.pdf

04/10/2023LearnDataVault.com 34

Page 35: IRM UK - 2009: DV Modeling And Methodology

Know when to hold ‘em, know when to fold ‘em

When to use DV, and when not…

04/10/2023LearnDataVault.com 35

Page 36: IRM UK - 2009: DV Modeling And Methodology

The Challenger….

04/10/2023LearnDataVault.com 36

The challenger says:• My system works fine, why should I use the Data Vault? • I don’t have volume problems…• I don’t have compliance/auditability problems…• I don’t have real-time problems…• My system produces matching results across lines of business…• I’ve never had to “re-state” the data in the warehouse…• I can still build new marts, and conform dimensions in 30 days or less…• My business doesn’t acquire new systems often (if ever)• My incoming data sets don’t change

I Say…That’s wonderful, don’t fix what’s broken. Have a nice day, oh- but call me when or if you ever run into these problems…

Page 37: IRM UK - 2009: DV Modeling And Methodology

When to Apply the Data Vault

04/10/2023LearnDataVault.com 37

• Scalability• Auditability• Flexibility• Adaptability

Benefits:

• IT Agility• IT and Business Accountability• Reduction in Spread-Marts• Corporate Asset Development• Money Savings• Risk Mitigation• Successful EDW Implementations

Leads To…

Page 38: IRM UK - 2009: DV Modeling And Methodology

How to build a data vault

In 10 easy steps…

04/10/2023LearnDataVault.com 38

Page 39: IRM UK - 2009: DV Modeling And Methodology

Step 1

04/10/2023LearnDataVault.com 39

Identify your business processes, followed by your business keys (that are used to identify the data that flows through the business processes)

** NOTE: Along the way, document your assumptions, document your reasons for choosing keys, and modeling designs, develop a list of questions to be answered by business users…

Page 40: IRM UK - 2009: DV Modeling And Methodology

Step 2

04/10/2023LearnDataVault.com 40

Identify the issues/problems that might be carried with the identified business keys, annotate the risks, and mitigate each one.

Page 41: IRM UK - 2009: DV Modeling And Methodology

Step 3

04/10/2023LearnDataVault.com 41

Identify the units of work, the associations – LINK tables, where keys combine to form a notion, a concept, and a relationship.

Page 42: IRM UK - 2009: DV Modeling And Methodology

Step 4

04/10/2023LearnDataVault.com 42

Identify the descriptive data that belongs to SINGLE Hub Keys, ensure that the data doesn’t represent or rely on a relationship.

Page 43: IRM UK - 2009: DV Modeling And Methodology

Step 5

04/10/2023LearnDataVault.com 43

Identify the Satellite data that depends on relationships – move it to the appropriate LINK table.

HINT: If you “want” to put a Foreign Key in a Satellite, you have a clear sign that the Satellite is in the WRONG place, and needs to be assigned to a LINK table rather than a HUB.

Page 44: IRM UK - 2009: DV Modeling And Methodology

Step 6

04/10/2023LearnDataVault.com 44

Scope the Model Down to a managable chunk. Implement the first two Hubs, Hub Satellites, and first Link. BUILD IN INCREMENTS!

Page 45: IRM UK - 2009: DV Modeling And Methodology

Step 7

04/10/2023LearnDataVault.com 45

Setup the key generation load routines, setup the staging area, and begin loading data.

Page 46: IRM UK - 2009: DV Modeling And Methodology

Step 8

04/10/2023LearnDataVault.com 46

Review any “truncation” errors, or any data-type conversion problems, fix the staging area, and remove duplicates.

Page 47: IRM UK - 2009: DV Modeling And Methodology

Step 9

04/10/2023LearnDataVault.com 47

Begin Loading the Data Vault. Load all Hubs, then all Hub Satellites, Then all Links, and finish with All Link Satellites.

Page 48: IRM UK - 2009: DV Modeling And Methodology

Step 10

04/10/2023LearnDataVault.com 48

Reconcile the Data Vault to the source system, then build a first data mart from the results. Bring business value FAST!

Page 49: IRM UK - 2009: DV Modeling And Methodology

Instructor led lab

04/10/2023LearnDataVault.com 49

Page 50: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 50

10 minutes to find the Hubs….

Page 51: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 51

Possible Hubs From Northwind

Page 52: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 52

10 Minutes to find the Links…

Page 53: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 53

Possible Links From Northwind

Page 54: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 54

10 minutes to find the Satellites…

Page 55: IRM UK - 2009: DV Modeling And Methodology

04/10/2023LearnDataVault.com 55

Possible Satellites From Northwind

Page 56: IRM UK - 2009: DV Modeling And Methodology

What did we learn?• We often deal with more than 1 system at a

time… this was a lab with only one model.• We didn’t have any business requirements that

we might need to answer questions, but doesn’t that reflect real-life?

• The data set is extremely dirty (you never have that in your systems right?)

• Time Zone based data can be a problem• Lack of metadata causes integration issues and

modeling decisions

04/10/2023LearnDataVault.com 56

Page 57: IRM UK - 2009: DV Modeling And Methodology

57

The Experts Say…“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon

“The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst

“The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney

Page 58: IRM UK - 2009: DV Modeling And Methodology

58

More Notables…

“This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner

“[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners

will benefit from..” Scott Ambler

Page 59: IRM UK - 2009: DV Modeling And Methodology

59

Where To Learn More• The Technical Modeling Book:

http://LearnDataVault.com

• The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions

• Contact me:http://DanLinstedt.com - web [email protected] - email

• World wide User Group (Free)http://dvusergroup.com