tdwi agile data warehouse - dv, what is the buzz about

Post on 19-Jan-2015

1.406 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is the presentation I did on TDWI EU in Munich - date; june 22nd, 2012. It is about a robust, agile and reliable way of deploying data warehouse environments. The majority of data warehouses in the Netherlands is Data Vault based now which instigated a wave of innovation of engineers and software vendors that pursued model driven development based on pattern based ETL,standardized modeling and a certain architectural style.

TRANSCRIPT

R.D.Damhof

Data Vault, What is the buzz about

TDWI München June 18, 2012 Ronald Damhof

Agile Data Warehousing

R.D.Damhof

“Our highest priority is to satisfy the customer through early and continuous

delivery of valuable software” Agile Manifesto, 2001

Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern,

Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave Thomas

R.D.Damhof

Source

Source

‘Semantic gap’

‘Calculating risk’

‘Yield modules’

‘Customer ���segmentation’

R.D.Damhof

Everybody mines their own data Everybody enriches their own data

Everybody uses their own data User = Developer

With his selfmade tools Data quality determined by the individual It’s a grind – limited reusability Leadtimes unpredictable No management

R.D.Damhof

Lets ‘order’ an information product And hire a master/expert Separation between user/developer Developer/expert mines the data The information product = custom made

Data quality is mostly dependable on the developer/expert

Leadtimes unpredictable

Still not much reusability

R.D.Damhof

A central department who knows what information you need

That assembles information products, ready to be used for you

‘I now what you want’ – black

Efficiency is the name of the game

At least I got something, but it does not comply - even remotely - to my needs

Even worse; the guild-days are still there – the expert is now submerged, but needed to get the data you actually need. Introduction of management – you want something? Please apply in 3-fold…

R.D.Damhof Stephen Denning (2011) – Radical Management

Creating information products, the moment they are asked for Against quality criteria which are in line with the expectation of the customer Empower the customer with skills and facilities to be more self sufficient Minimize ‘data’-stock as much as possible Embrace new wishes and changes required by the customer The customer is the most important part of the production process

R.D.Damhof

A modern data management environment: The ‘Supermarket’

The ‘Restaurant’

The ‘Do it yourself buffet’

R.D.Damhof

R.D.Damhof

Push characteristics §  Mass production §  Known specifications, operational definitions, standards §  Repeatable, predictable, & even better; uniform process §  Part of the system that needs statistical control §  Inventory allowed/necessary §  Supply driven §  Reliability over flexibility

Pull characteristics §  Just in time § Demand driven § Build to order § Preferably no inventory § Flexibility over Reliability

R.D.Damhof

Back to the issue at hand……

§  What: the ‘production process of data’ §  Where: Coordination - Local versus central §  How: System Engineering - Systematic vs. Opportunistic §  What principles guide us - leading principles

R.D.Damhof

1  

2  

1  

2  

3  

4  

1  

2  

3  

4  

1  

2  

3  

4  

1  

2  

3  

4  

Data  &  fu

nc.o

n  service  

3  

4  

End-­‐user  (Local)  

Recipient  

Inform

a.on

 Delivery  process  

Generic  proces  (Central)  

Data  sources  (internal  &  external)  

4.  Generate  Informa.on  products  

3.  Enrich  and  cleanse  data  

2.  Register  &  Standardize  

1.  Get  the  raw  uncut  data  

Informa.on  Delivery  Proces  

Local  vs  Central  deployment  

R.D.Damhof

IT Development

Delegated Development

Selfservice Development

Development line discipline (OTAP) Developers at a distance from users

Mutually dependent/ within frameworks Heavy separation of function

Lightweight development process Minimum of specialisation/ distinction of roles

Self-sufficient/ limited freedom

Manoeuvrability (opportunistic approach)

Sustainability (Systematic approach)

Ad-hoc development proces Developer=user

Self-sufficient/ great degree of freedom Very broad tasks

System Engineering - Systematic vs. Opportunistic

R.D.Damhof

Adaptible

Sustainable

Decoupled

Centralized

Compliant

Standardized &

Industrialized Effective

Leading principles

R.D.Damhof 15  

Enterprise    Data  Warehouse  

BI  Apps  Analysis  

BI  apps  Reports  

BI  Apps  Ad-­‐hoc  

Company  xxx  data  management  Domain  

Data,  ‘What’   Func.on,  ‘How’   ‘Where’,  ‘Whom’  

Busin

ess  V

iew  

Sources   Source  store  

1   2   3   4  

R.D.Damhof 16  

Sustainable  

Compliant  

Decoupled  

Standardized  

Centralized  

Adaptable  

Effec.ve  

Source  to  product  

Sourcestore  to  product  

Sourcestore  to  BV   EDW  (DV)  

R.D.Damhof 17  

Enterprise    Data  Warehouse  

BI  Apps  Analysis  

BI  apps  Reports  

BI  Apps  Ad-­‐hoc  

Company  xxx  data  warehouse  &  Business  Intelligence    Domain  

Data,  ‘What’   Func.on,  ‘How’   ‘Where’,  ‘Whom’  

Busin

ess  V

iew,    

Data  fe

eds  Sources   Source  store  

1   2   3   4  

R.D.Damhof 18  

Administra.ve  process  

Data  &  Informa.on  recipients  

Informa.on  Delivery  Process  

AXain  

Register  &  Standardize  

Enrich  

Generate&  Distribute  

Proces  

Decision-­‐  &  control  

Systems  (internal  &  external)  

DV  based  Data    

Warehouse  

Informa.on  products  

Data  products  

Compliance  repor.ng  

Supply  chain  op.miza.on  

Staging  

Risk  Management  

Performance  Management  

Fraud  detec.on  

Market  basket  analysis  

Business  rules  

PDCA  

Control  /  Metadata  

Pull  

Push  

Push  

Why DV?

R.D.Damhof 19  

Metamodel  driven  automa.on  -­‐ Models  (process,  rules  and  data)  determine  the  metadata,  the  metadata  determines  the  automa.on  ar.facts  -­‐ Aim  is  to  be  100%  declara.ve  -­‐ It  can  not  be  generated  all,  specific  tailored  metadata  will  remain  necessary  

Metadata  driven  automa.on  -­‐  Inputs:  Source  model(s),  target  model,  Template  Design,  Naming  conven.ons  -­‐  Advanced  inputs:  Normaliza.on  preferences,  Ontologies  Taken  from  Dan  Linstedt’s  blog  post:  hXp://danlinstedt.com/datavaultcat/code-­‐genera.on-­‐for-­‐data-­‐vault-­‐not-­‐as-­‐easy-­‐as-­‐you-­‐think/  

Template  driven  automa.on  -­‐ In  the  most  basic  forms;  documenta.on    -­‐  describing  a  paXern  -­‐ More  advanced;  genera.ng  XML  code  for  2nd  gen.  ETL  tooling  -­‐ Vb  -­‐  hXp://www.grundsatzlich-­‐it.nl/bi-­‐tools-­‐templator.html  

Data  Vault  implementa.ons  

R.D.Damhof 20  

My PoV about (Data Vault) automation Tooling

§ Generation is an aid, not a goal in itself Do not accommodate the principles to fit the tool.... Look for decoupling

§ Truly understand the mechanics - handcraft it first! Invest in proper education and learning Invest in ‘getting ready’ time Involve your ‘customers’ from the start

§ PoC, PoC, PoC

§ Deliver, Deliver, Deliver

R.D.Damhof

Agility & Data Vault (1)

Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; 'Divide and Conquer to beat the Size / Complexity Dynamic’

R.D.Damhof

Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time.

Agility & Data Vault (2)

R.D.Damhof

Why is it that - as your (Data Vault based) data warehouse grows - your costs grow ‘merely’ in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially.

Agility & Data Vault (3)

R.D.Damhof

Data Vault as-such is not Agile, it is the development process that needs to be agile, DV merely supports

the agile development process.

24

“Our highest priority is to satisfy the customer through early and continuous

delivery of valuable software”

Agile Manifesto, 2001 Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham,

Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern, Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave Thomas

R.D.Damhof

Data Model Time Line Historic Overview

© (Linstedt, Graziano, & Hultgren, The New Business Supermodel, The Business of Data Vault Modeling, 2008, p. 36)

§  Created By Dan Linstedt §  Released in 2000 §  Formally Introduced in the Netherlands in 2007

§  First DV Book: The Business of Data Vault Modeling 2008 §  First (Dutch) User group in 2010 §  Technical book from Dan Linstedt in 2011

R.D.Damhof

Application���Architecture

R.D.Damhof

Top Down Approach

R.D.Damhof

Bottom Up Approach

R.D.Damhof

Bottom Up Approach

R.D.Damhof

Bottom Up Approach

R.D.Damhof

Bottom Up Approach

R.D.Damhof

Irony

R.D.Damhof

Hybrid Approach (Data Vault)

R.D.Damhof

R.D.Damhof

ETL/Load Architecture -  100% of the data (within

scope) 100% of the time -  Source driven /Auditable: -  “Fact Oriented” -  Template/metadata driven -  No Business Rules

Kimball or Inmon ETL -  Complex ETL -  Truth oriented -  Business Rules before EDW

Pictures: Genesee Academy ©

R.D.Damhof 36  

Data  Vault  

Business  Transac.on  System    

Business  Transac.on  System    

Structure  transforma.on  Hub  =  business  keys  

Datasets  

Business  rule  execu.on  Structure  and  value  transforma.on  

Staging  Out  

Classic Data Vault Application Architecture

Adaptable   Sustainable   Compliant   Decoupled   Effec.veness   Standardized   Centralized  

Rule  Vault  

Generic  Business  Rules  

?   ?  

R.D.Damhof

Data Vault Application Architecture

§  Central EDW §  Business rules downstream §  Incremental/Non destructive Loading §  100% of the data (within scope) 100% of the time §  Auditable/Partly source driven

R.D.Damhof

Modeling

R.D.Damhof

R.D.Damhof

R.D.Damhof

R.D.Damhof Pictures: Genesee Academy ©

R.D.Damhof Pictures: Genesee Academy ©

R.D.Damhof Pictures: Genesee Academy ©

R.D.Damhof Pictures: Genesee Academy ©

R.D.Damhof

Data Vault Constructs

Pictures: Genesee Academy ©

R.D.Damhof

Data Vault Constructs

Pictures: Genesee Academy ©

R.D.Damhof

Data Vault Constructs

Pictures: Genesee Academy ©

R.D.Damhof

Core Components

R.D.Damhof

Data Vault Core Components

Pictures: Genesee Academy ©

R.D.Damhof

Data Vault Core Components

Pictures: Genesee Academy ©

R.D.Damhof

Hubs

Pictures: Genesee Academy ©

R.D.Damhof

Hubs

Pictures: Genesee Academy ©

R.D.Damhof

Hubs

Pictures: Genesee Academy ©

R.D.Damhof

Satellites

Pictures: Genesee Academy ©

R.D.Damhof

Satellites

Pictures: Genesee Academy ©

R.D.Damhof

Links

Pictures: Genesee Academy ©

R.D.Damhof

Links

Pictures: Genesee Academy ©

R.D.Damhof

Loading

R.D.Damhof

HUB load

Pictures: Genesee Academy ©

R.D.Damhof

INSERT INTO customer_hub (cust#,load_dts,record_src) SELECT source.customer#, @load_dts, @record_src FROM source_customer AS source WHERE

NOT EXISTS (SELECT * FROM customer_hub AS hub WHERE hub.customer#=source.customer#)

HUB load

Pictures: Genesee Academy ©

R.D.Damhof

Loading a Link

Link Load

Pictures: Genesee Academy ©

R.D.Damhof

Link Load

INSERT INTO custcontact_link(cust_id,contact_id,load_dts, record_src) SELECT source.customer#, @load_dts, @record_src

FROM source_table AS source INNER JOIN contact_hub AS contact ON

contact. contact#= source.contact# INNER JOIN customer_hub AS cust ON

cust. customer#= source.customer# WHERE NOT EXISTS (SELECT * FROM custcontact_link AS link WHERE link. contact_id= contact.id and link.cust_id= cust.id)

Pictures: Genesee Academy ©

R.D.Damhof

Loading a Satellite

Satellite Load

Pictures: Genesee Academy ©

R.D.Damhof

Satellite Load

INSERT INTO customer_sat (hub_id,load_dts, name,record_src) SELECT hub.id, @load_dts, source.cust_name, ,@record_src

FROM source_customer AS source INNER JOIN customer_hub AS hub ON

cust.customer#= source.customer# # INNER JOIN customer_sat AS sat ON sat.id= hub.id# AND sat “Is most recent” AND

sat.name <> source.name

Pictures: Genesee Academy ©

R.D.Damhof

Data Vault Loading Paradigm

Pictures: Genesee Academy ©

R.D.Damhof

Top 10 Rules for Data Vault Modeling

Pictures: Genesee Academy ©

R.D.Damhof 68

Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; 'Divide and Conquer to beat the Size / Complexity Dynamic’

Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time.

Why is it that - as your (Data Vault based) data warehouse grows - your costs grow ‘merely’ in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially.

Agility & Data Vault - recap (1)

R.D.Damhof

➡  Mass production

➡  Known specifications, operational definitions, standards

➡  Repeatable, predictable, & even better; uniform process

➡  Part of the system that needs statistical control

➡  Inventory allowed/necessary

➡  Mainly supply driven

➡  Reliability over flexibility

Remember the Push characteristics Data Vault

Data Vault

Data Vault

Data Vault

Data Vault

Data Vault

Data Vault

Automation of a Data Vault ‘production process’ is just common sense

Agility & Data Vault - recap (2)

R.D.Damhof

Bonus Slides���Forks and mutations in DV ‘evolution’

R.D.Damhof 71  

Data  Vault  

Business  Transac.on  System    

Business  Transac.on  System    

Structure  transforma.on  Hub  =  business  keys  

Datasets  

Business  rule  execu.on  Structure  and  value  transforma.on  

Staging  Out  

Type 1 - Classic Data Vault

Adaptable   Sustainable   Compliant   Decoupled   Effec.veness   Standardized   Centralized  

Rule  Vault  

Generic  Business  Rules  

?   ?  

R.D.Damhof 72  

Staging  Vault  

Business  Transac.on  System    

Business  Transac.on  System    

Structure  transforma.on  No  integra.on,  Hub=surrogate  keys  Persis.ng  staging  in  DV  format  

Business    Data  Vault  

Business  rule  execu.on  Integra.on  DV  modelled    

Staging  Vault  

Data  Marts  

Structure  transforma.on  

Type 2 - Source Data Vault

Sustainable   Compliant   Decoupled   Standardized   Centralized  

?   ?   ?  Adaptable   Effec.veness  

R.D.Damhof 73  

Source  

Source  

 100%  Seman.c  gap  

Source  

Source  

100%  Seman.c  gap  

Staging  DV  

Staging  DV  

Business  DV  

Integra.on,  cleansing,  consolida.on  Business  rule  execu.on  upstream  ??  DV  modelled    

S.ll  the  source  

R.D.Damhof 74  

Source  

Source  

 100%  Seman.c  gap  

Source  

Source  

100%  Seman.c  gap  

Staging  DV  

Staging  DV  

Business  DV  

Integra.on,  cleansing,  consolida.on  Business  rule  execu.on  upstream  ??  DV  modelled    

S.ll  the  source  

Source  

Source  Data  Warehouse  

R.D.Damhof

Wanna know more? § Training & certification: www.geneseeacademy.com

§ Books: ‘Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault’ – D.Linstedt / K.Graziano

§  Linkedin: Data Vault Discussions (approx. 800 members)

§ Niche non-commercial conferences; www.dwhautomation.com

§ Many blogs, articles, presentations on the World Wide Web

§ The best way to learn; try it, make some code, experience, engage

R.D.Damhof 76  

Drs.  Ronald  D.  Damhof  

Blog   hXp://prudenza.typepad.com/  hXp://www.b-­‐eye-­‐network.com/blogs/damhof/    

Linkedin   hXp://nl.linkedin.com/in/ronalddamhof  

Email   ronald.damhof@prudenza.nl  

TwiXer   RonaldDamhof  

Skype   Ronald.Damhof  

Mobile   +31(0)6  269  67  184  

Others   Informa.on  Quality  Cer.fied  Professional  (IQCP)  Data  Vault  Cer.fied  Grand  Master  Cer.fied  Scrum  Master  Member  of  the  Boulder  BI  Brain  Trust  (#BBBT)  

Ronald  Damhof  is  an  independent  prac..oner  in  the  field  of  data  management  and  decision  support.  Graduated  in  1995  in  the  study  of  Economics.  Since  1995  he  worked  as  a  prac..oner  into  the  field  of  Informa.on  Management  with  a  focus  on  decision  support  and  data  management,  trying  hard  to  enhance  the  rigor  and  relevance  in  these  fields  by  combining  scien.fic  research  with  the  everyday  challenges  of  the  prac..oner.  Ronald  is  mainly  hired  by  customers  in  the  role  of  business/IT  architect,  auditor,  coach  &  trainer.  He  blogs  on  B-­‐Eye-­‐Network.com  as  well  as  his  own  blog,  is  a  member  of  the  pres.gious  BBBT,  wrote  several  ar.cles  regarding  decision  support  architectures  and  is  a  researcher  in  the  field  of  Informa.on  Management.      Although  Ronald  likes  to  work  with  theore.cal  grounded  research  and  proven  prac.ces,  Ronald  is  not  a  'white  paper'  architect;  put  your  money  where  your  mouth  is,  is  his  moXo.  He  likes  to  see  architectures  'live'  in  enterprises,  not  just  write  about  it.  In  most  organiza.ons  his  role  extends  architecture  onen.  In  truely  agile  spirit  the  roles  he  plays  depend  on  the  context  of  the  client;  he  can  be  a  missionary  (selling  the  value),  a  project  manager  (geong  it  done),  a  scrum  master  (removing  impediments),  specialist  (educa.ng  hardware  peeps,  data  architects,  data  logis.cs  etc.)  or  a  leader.  

Thank You

top related