data vault modeling & methodology - 1105 media: home...
TRANSCRIPT
Data Vault Modeling & Methodology
Technical Side and Introduction© Dan Linstedt, 2010, http://DanLinstedt.com
Technical DefinitionThe Data Vault is a detail oriented, historical tracking and
uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.
Architected specifically to meet the needs of today’s enterprise data warehouses
5/28/2010 2http://empoweredHoldings.com
What Does One Look Like?
Customer
Sat
Sat
Sat
F(x)
Customer
Product
Sat
Sat
Sat
F(x)
Product
Order
Sat
Sat
Sat
F(x)Order
Elements:•Hub•Link•Satellite
LinkF(x)
Sat
Records a history of the interaction
Hub = List of Unique Business KeysLink = List of Relationships, AssociationsSatellites = Descriptive Data
5/28/2010 3http://empoweredHoldings.com
Excel As A Source…
Hub AccountHub Account
Link Acct To GroupLink Acct To Group
Hub GroupingHub Grouping
Sat Group TypeSat Group Type
HierarchicalLink of GroupsHierarchicalLink of Groups
Raw SourceData in DV
User GroupingStructures
Level A
Level BLevel C
ItemItemItem
Flattened Structure
StagingTable
Do you have a power executive who is technically inclined, who runs the business off a
rogue spreadsheet?
5/28/2010 4http://empoweredHoldings.com
CORE ARCHITECTUREData Vault Basic Elements
5/28/2010 5http://empoweredHoldings.com
Data Vault Core Architecture• Hubs, Links, Satellites• Hubs = Unique List of Business Keys• Links = Unique List of Relationships across keys• Satellites = Descriptive Data
• Satellites have 1 and only one parent table• Satellites cannot be “Parents” to other tables• Hubs cannot be child tables
• Last Seen Dates, Load Dates, Record Sources, and Surrogate keys are notpart of the core architecture. They exists to help models and key migration.
5/28/2010 6http://empoweredHoldings.com
Hub EntityA Hub is a list of unique business keys
Primary Key
<Business Key>Load DTSLast Seen DTSRecord Source
Hub StructureProduct Sequence ID
Product NumberProduct Load DTSProduct Last Seen DTSProd Record Source
Hub Product
• A Hub’s business key is a unique index• A Hub’s load date represents the FIRST TIME the EDW saw the data• A Hub’s record source represents: First – the “Master” data source (on
collisions), if not available it holds the origination source of the actual key
Unique Index(Primary Index)
5/28/2010 7http://empoweredHoldings.com
Link EntityA Link is an intersection of two or more business keys
It can contain Hub keys and other Link keys
Primary Key
{Hub/Lnk Surrogate Keys 2..N}Load DTSLast Seen DTSRecord Source
Link StructureLink Line Item Sequence ID
Hub Product Sequence IDHub Order Sequence ID**Line Item NumberLoad DTSLast Seen DTSRecord Source
Link Line-Item
A Link’s business key is a composite unique index• A Link may or may not have a “**Item Numbering” attribute• A Link’s load date represents the FIRST TIME the EDW saw the data• A Link’s record source represents: first – the “Master” data source (on collisions), if not
available, it holds the origination source of the actual key
Unique Index(Primary Index)
5/28/2010 8http://empoweredHoldings.com
Satellite EntityA Satellite is a time-dimensional table housing detailed information about the Hub’s or Link’s business keys
Primary KeyLoad DTSExtract DTS
DetailBusiness Data
{Update User}{Update DTS}Record Source
**Load End Date
Customer #Load DTSExtract DTS
Customer NameCustomer Addr1Customer Addr2{Update User}{Update DTS}Record Source
**Load End Date
Unique Index(Primary Index)
• Satellites are defined by TYPE of data and RATE OF CHANGE
• Mathematically – this reduces redundancy and decreases storage requirements over time (compared to a star schema)
5/28/2010 9http://empoweredHoldings.com
THINKING OF BREAKING RULES…
Rules and Standards GOVERN your deployment…
5/28/2010 10http://empoweredHoldings.com
Some Rules For You• NO Foreign Keys in the Satellites!• NO Hub to Hub (Parent Child relationships)• NO Enforcement of relationships in the data model…• NO Date Time attributes in HUB or LINK Primary Keys…
• Why??– It breaks flexibility– It breaks auditability / accountability– It breaks Scalability– It breaks Performance– It introduces “Decisions” in the architecture, which breaks
Patterns!
Up Next Links and the Unit Of Work…
5/28/2010 11http://empoweredHoldings.com
Business Key Definitions…
5/28/2010 http://empoweredHoldings.com 12
• “The contracts system is responsible for creating customer account numbers. The EDW will never see other systems creating customer account numbers.”(Requirement #101)
Sales is clearly creating customer numbers, how do we detect the issue and alert the business?
Point: Not all business keys are created EQUAL!
Link: Unit of Work
Hub Product
Hub Category
Hub Supplier
LinkLine Item
LinkProd-Cat
Sat Effectivity
LinkProd-Supp
Sat Effectivity
Link: Product by Supplier by Category
Unit Of Work
These links are Optional, usedFor exploration only
Link Product by CategoryLink Product by Supplier
5/28/2010 13http://empoweredHoldings.com
What Happens When:We Break the Unit of Work
Product_ID Category_ID Supplier_ID
222 12 96222 12 93729 15 87222 17 93 Product_ID Category_ID
222 12222 17729 15
Product_ID Supplier_ID
222 96222 93729 87
ModelNormalization
Question: After normalizing, how can you reconstruct the source image EXACLTY as it stands?
Source System UOW
Link Product by Supplier
Link Product by Category
5/28/2010 14http://empoweredHoldings.com
What Happens When:Trying to Rebuild from Two Links
Product_ID Category_ID Supplier_ID
222 12 96222 12 93222 17 96222 17 93729 15 87
Product_ID Category_ID
222 12222 17729 15
Product_ID Supplier_ID
222 96222 93729 87
ModelNormalization
Re-joining the data, creates a record that does not exist in the original source system, this is the same problem that BI engineswill have when putting together Data Mart results.
Source System UOW
Link Product by Supplier
Link Product by Category
5/28/2010 15http://empoweredHoldings.com
Link: Unit of Work Kept Together
Product_ID Category_ID Supplier_ID
222 12 96222 12 93729 15 87222 17 93
Product_ID Category_ID Supplier_ID
222 12 96222 12 93729 15 87222 17 93
Source Table UOW Link: Product by Category by Supplier
Commutative Property: Enable reproduction of the source exactly as it stands
UOW is properly represented by a single Link in the Data Vault
Source System Data Vault
5/28/2010 16http://empoweredHoldings.com
CURRENT LOADING PAINWhat keeps you up at night?
5/28/2010 17http://empoweredHoldings.com
Problems with EDW Loads TodayTechnical Issues:• 2am Wakeup Calls – because “data” won’t fit the business rules• “Emergency Fixes” to Production• Speed, Speed, Speed (shrinking load window + more data)• Can’t load real-time data (business rules in the way!!)• Business won’t buy better, faster, hardware!
Business Issues:• Maintenance cycles take too long• Maintenance costs continue to increase• Fixes to “existing mappings” break working logic• Complexity of existing systems become unsustainable to business• IT isn’t using 80%+ of the hardware resources given to them (their jobs are
running at 40% utilization when they are “full-bore”)
5/28/2010 18http://empoweredHoldings.com
Solutions!Technical Solutions• All Parallel Job Streams As much as possible• 1 Target Per Map, Per Action reduces complexity• Generate Data Flows based on patterns (then focus on the real work)• Get some SLEEP at night!! (no more production modifications)
Business Solutions• Decrease turn-around time• Increase Performance• Handle Real-Time Data!!• Reduce Complexity = Reduce Costs, Reduce Time to Implement• Get the power back for decision making, discovering and building your own
marts
5/28/2010 19http://empoweredHoldings.com
How?
5/28/2010 20http://empoweredHoldings.com
BASIC LOADING CONCEPTSSome standards to follow…
5/28/2010 21http://empoweredHoldings.com
Loading: A Golden Rule
It’s all about Auditability…
100% of the Data Loaded to the EDW 100% of the time!
5/28/2010 22http://empoweredHoldings.com
Load Date / End Date GeologyBatch Load
Real-Time Loading
5/28/2010 23http://empoweredHoldings.com
Real Time Loading - DV Stock TradeACCOUNT=123443576 TRADE="Buy" STOCK=“DAN" SHARES=100.0 CURRENCY="USD" PRICE=115.52 DATE="Feb 20, 2002“Comment="Buy Order to Execute"
123443576
Acct Hub
“DAN”
Stock Hub
Trade Link
TRADE="Buy" SHARES=100.0 CURRENCY="USD" PRICE=115.52 DATE="Feb 20, 2002“Comment="Buy Order to Execute"
Transactional Link
= Inserts Only, no Updates
1
2
3
Months in Production
# of Inserts
10M25M50M75M
1 2 3 4 5 6 7 8
First Data Set Loaded
New Systems Data Added
• As critical mass of current business keys is reached, the insert rates decrease rapidly.
• New systems add new keys, quickly and efficiently to an existing Hub.
5/28/2010 24http://empoweredHoldings.com
Batch Load Date Time Stamp
CNTRL_DTELOAD_DTS
EDW – Data VaultStaging Area
Stage LoadSTAGING TABLESequence_ID….Load_DTSRecord_Source
STAGING TABLESequence_ID….Load_DTSRecord_Source
Stage Load
Load DateIs exactly the sameFor All rows
5/28/2010 25http://empoweredHoldings.com
Sources Stage HubsHubSatellites
LinkSatellites Dimensions Facts
Links
Data Vault Loads Data Mart LoadsStaging Loads
Major Synchronization PointsProcessing:• All loads are done in parallel• Sets of processes “wait” for the previous set to complete• Processes are run as soon as data is ready• No other “waiting” time is required• Load dependencies are greatly reduced
Parallel Load Architecture - Batch
5/28/2010 26http://empoweredHoldings.com
Mathematics of Batch LoadingIts all about SPEED SPEED SPEED
EDW:1 Billion RowsAnd growing
10 Million Incoming Rows
60% - 80%Inserts
(Never Seen Before)
10%-20%UpdatesMatchedBy KEY
5%Deletes
• Inserts are the single fastest operation in the Database!
• Updates are the single slowest operation in the Database!
Q: Why push 80% of your Insert data through “the heaviest/slowest”transformation logic?
5/28/2010 27http://empoweredHoldings.com
Simple Loading Patterns
Source SQ LKP Target Filter If Exists Target Insert
Update View: SelectALL that exist By PK in targetONLY those with DELTA
Source SQ Target Insert
Source(Stage)
Insert View: SelectALL that do not existBy PK in target Target
Rule: 1 Target Per Data Flow (map/graph) Per Action
5/28/2010 28http://empoweredHoldings.com
Results of Pattern Tuning
FROM THIS…..
• 5M rows @ 600 RPS = 2.31 hrs• OR: 5m @ 7k rps = 11.9 mins• No parallelism
TO THIS!• Pass 1: 5m @ 33k RPS = 2.52 mins• Pass 2:
•5m @ 33k RPS = 2.52 mins•5m @ 25k RPS = 3.33 mins
• Pass 3:•5m @ 50k RPS = 1.66 mins•5m @ 33k RPS = 2.52 mins•5m @ 40k RPS = 2.03 mins•5m @ 23k RPS = 3.61 mins
• Total Time:•2.52+3.33+3.61 = 9.46 mins
This map must run at a minimum of 10k rps to beat the parallel times5m @ 10k rps = 8.33 mins
5/28/2010 29http://empoweredHoldings.com
LOADING THE DATA VAULTPatterns Take the Cake!
5/28/2010 30http://empoweredHoldings.com
Loading Templates: Hubs
• Select a “Master” system, and a hierarchy of importance for sub-systems to annotate arrival location of data
• Purpose of the loading template: Find out if the business key exists in the hub, if not – insert it
• Use a distinct list (unique) of business keys coming from the staging area
StagingDataStagingData
Distinct ListBK KeysDistinct ListBK Keys
Insert IntoTarget(Gen Surrogate)
Insert IntoTarget(Gen Surrogate)
HubHub
Drop RowFrom Feed
No
Yes
Exists InTarget?
5/28/2010 31http://empoweredHoldings.com
Loading Templates: Links
StagingDataStagingData
Distinct ListBusn KeysDistinct ListBusn Keys
Insert IntoTarget(gen surrogate)
Insert IntoTarget(gen surrogate)
LinkLink
Drop RowFrom Feed
No
Yes
Lookup EACHHubsSurrogateKeys
Lookup EACHHubsSurrogateKeys
• Select a “Master” system, and a hierarchy of importance for sub-systems to annotate arrival location of data
• Purpose of the loading template: Find all relationships between business keys, then, is the relationship already recorded in the Link, if not – insert it
• Use a distinct list of related business keys
Exists InTarget?
5/28/2010 32http://empoweredHoldings.com
Loading Templates: Satellites
• Select a “Master” system, and a hierarchy of importance for sub-systems to annotate arrival location of data
• Purpose of the loading template: Gather descriptive data, compare to most recent copy of information in satellite, and if there are any deltas – load, if not, don’t load
• Use a distinct list of descriptive fields from the source systems
StagingDataStagingData
Distinct ListSat RowsDistinct ListSat Rows
Insert IntoTargetInsert IntoTarget SatelliteSatellite
Drop RowFrom Feed
No
Yes
Lookup EACHHub’s or Link’sSurrogateKeys
Lookup EACHHub’s or Link’sSurrogateKeys
Find Latest Sat RowFind Latest Sat Row
All ColumnsMatch?
5/28/2010 33http://empoweredHoldings.com
GETTING STARTED… HOW TOHow to build your Data Vault…
5/28/2010 34http://empoweredHoldings.com
Step 1: Establish Scope(Build Business Case Model)
5/28/2010 http://empoweredHoldings.com 35
Step 1: Define Business Keys
5/28/2010 http://empoweredHoldings.com 36
Hub Campaign Hub Customer
Hub Invoice
Hub Products
Step 2: Define Associations
5/28/2010 http://empoweredHoldings.com 37
Hub Campaign Hub Customer
Hub Invoice
Hub Products
Link Campaign byInvoice by Customer
Link Invoice Line Items
Link Product onCampaign
Step 3: Define Descriptive Data
5/28/2010 http://empoweredHoldings.com 38
Hub Campaign Hub Customer
Hub Invoice
Hub Products
Link Campaign byInvoice by Customer
Link Invoice Line Items
Link Product onCampaign
Sat EffectivenessRatingsSat EffectivenessRatings
Sat EffectivenessDatesSat EffectivenessDates
Sat Availability DatesSat Availability DatesSat Defect ReasonsSat Defect Reasons Sat Stock QuantitiesSat Stock Quantities
Sat DescriptionsSat Descriptions
Sat Dates andAmountsSat Dates andAmounts
Sat AmountsSat Amounts Sat QuantitiesSat Quantities
Sat AddressSat Address
Sat ContactsSat Contacts
Sat DetailsSat Details
Step 4: Build Source Model (PK/FK)(No Pictures, Sorry)• Ensure the source model (DDL Only) has Primary and Foreign Keys defined• Normalize the source model (if not normalized)• Capture and integrate all source systems involved (if not already captured)• Add Comments to the DDL (tables and fields)
5/28/2010 http://empoweredHoldings.com 39
Step 5: Build Cross-Reference
5/28/2010 http://empoweredHoldings.com 40
SOURCE TABLE SOURCE COLUMN GROUP TARGET TABLE TARGET COLUMNAHLTAT_DIAGNOSIS DOC_REF 1 SAT_AHLTAT_DIAGNOSIS DOC_REF
DATAID 1 HUB_DIAGNOSIS DIAGNOSIS_DATAIDFACILITYNCID 1 HUB_FACILITY FAC_IDDIAGNOSISNCID 1 SAT_AHLTAT_DIAGNOSIS DIAGNOSISNCIDENCOUNTERNUMBER 1 HUB_EVENT EVNT_IDCLINICIANNCID 1 HUB_CLINICIAN CLINICIAN_NCIDUNIT_NUMBER 1 HUB_UNIT UNIT_IDMEDCINID 1 HUB_MEDCIN MEDCIN_IDCREATETIME 1 SAT_AHLTAT_DIAGNOSIS CREATETIMECREATEUSERNCID 1 SAT_AHLTAT_DIAGNOSIS CREATEUSERNCIDMODIFYUSERNCID 1 SAT_AHLTAT_DIAGNOSIS MODIFYUSERNCIDMODIFYTIME 1 SAT_AHLTAT_DIAGNOSIS MODIFYTIMEPRIORITY 1 SAT_AHLTAT_DIAGNOSIS PRIORITYDIAGNOSESCOMMENT 1 SAT_AHLTAT_DIAGNOSIS DIAGNOSESCOMMENT
The purpose of such an exercise is not to identify all the elements, but specifically to identify the target Hubs, (ie: the business keys), target Links, and at LEAST a single Satellite for at least 1 source column…
The engine (SaaS) will automatically assign all other descriptive elements to thefirst Satellite identified.
Step 6: Generate Baseline ETL/ELT
5/28/2010 http://empoweredHoldings.com 41
SourceDDL
TargetDDL
Cross-RefMapping
XLS
Generate Code,Reports, Documentation
Data Flows(Mappings / Graphs)
CONCLUSIONS / SUMMARYWhat did we learn?
5/28/2010 42http://empoweredHoldings.com
Data Vault…Modeling Is…• Made up of Hubs, Links, and Satellites• Easy to create and build• Hardest thing is to “find/locate” and define the Business Keys• Consistent, Scalable, Repeatable, Pattern Based• RULES BASED / STANDARDS DRIVEN
Loading Is….• Scalable, Fault-Tolerant, Parallelizable, Pattern Based• Generatable• Performance Based• 100% Restartable• Set Based• Devoid of “Soft” Business Rules!!
5/28/2010 43http://empoweredHoldings.com
Still - Lots To Learn…We didn’t cover: • Joins• point-in-time tables• building marts• business logic
components• SQL extraction• bridge tables
• what to do when…• dealing with bad data• architecting security,
managing governance, handling metadata
Contact me for Workshops (training), and Mentoring…
5/28/2010 44http://empoweredHoldings.com
Questions?Dan LinstedtPresident, Empowered Holdings, LLChttp://EmpoweredHoldings.comhttp://DanLinstedt.comTel: +1 802-524-8566E-Mail: [email protected]
SERVICES:• Consulting• Assessments• Product Selection Scorecards• Architecture / Design• Mentoring and Workshops (training)
5/28/2010 45http://empoweredHoldings.com