data warehouse data integration
TRANSCRIPT
DWH Data Integration
Christian Stade-SchuldtProject-A Ventures
BI Team Knowledge Transfer
Outline
Motivation
Import
Data Quality
Perfomance
Monitoring
,
Project-A, DWH Data Integration, 2014 2
What is data integration?
É combination of technical and business processes usedto combine data from disparate sources into meaningfuland valuable information
É encompasses discovery, cleansing, monitoring,transforming and delivery of data from a varietyof sources
É by far the largest portion of building a data warehouse
,
Project-A, DWH Data Integration, 2014 3
The ETL Process
Extract data from homogeneous or heterogeneous data sources
Transform the data for storing it in proper format or structure forquerying and analysis purpose
Load it into the final target
,
Project-A, DWH Data Integration, 2014 4
Processes and Jobs
É Process → Set of jobs in aparticular orderÉ Different processes for
separationÉ can run at different time
intervals
É File-dependency managementÉ Visualize graph
,
Project-A, DWH Data Integration, 2014 5
Processes and Jobs
É Job → Set of commands,depend on other jobs
É Command → Specific action(eg. run sql file)
É ⇒ developer friendly (plaintext files)
,
Project-A, DWH Data Integration, 2014 6
Sources
É Comma-separated filesÉ JSON filesÉ various databases (MySQL,
PostgreSQL, Microsoft SQLServer)
É via project codeÉ external APIs (usually export to
csv via cronjob)
,
Project-A, DWH Data Integration, 2014 7
The Schema Life-Cycle
É Data warehouse can be rebuild from scratch with every importÉ Import runs on a next schemaÉ Switch schemata in the last stepÉ Failure does not impact current data warehouse
,
Project-A, DWH Data Integration, 2014 8
Data Quality
É Real-world data is dirtyÉ Data quality is critical to data warehouse and business intelligence
solutionsÉ Goal:
É single point of truthÉ cleaned-up and validated dataÉ easily accessable for user
,
Project-A, DWH Data Integration, 2014 9
Data Quality 2
É Referential integrity → requires every value ofone attribute (column) of a relation (table)to exist as a value of another attribute in a different(or the same) relation (table)
É Check constraints (ADD CHECK)É Unique constraintsÉ Consistency checks → What goes in, has to come out,
No one’s left behind, some are. :(
,
Project-A, DWH Data Integration, 2014 10
Improving performance
É Cost-based scheduling for jobs(Priority Queue)
É Incremental loadsÉ Parallel jobsÉ Compute keys (e.g date,
corridor_id →(1000*sender_country_id +receiver_country_id))
É Index relevant columns
,
Project-A, DWH Data Integration, 2014 11
Monitoring
Runtime stats: How long doeseach job/process run
Timeline graph: How parallel is aprocess
,
Project-A, DWH Data Integration, 2014 12
Monitoring 2
DB schema: Visualize Schema
Relation sizes: Visualize growthover time
,
Project-A, DWH Data Integration, 2014 13
Monitoring 3
Index usage: Are indexes used orunecessary?
,
Project-A, DWH Data Integration, 2014 14
Naming conventions
É prefix schemata(e.g. os_, om_)
É schema names(e.g. dim_next, dim, tmp, data)
,
Project-A, DWH Data Integration, 2014 15
Naming conventions 2
Jobs follow a pattern:
load load data into the data schema
transform transform data into the dim schema
copy copy data into the dim schema (no transformation)
flatten creates flattened tables for faster access
constrain applies foreign key constrains
,
Project-A, DWH Data Integration, 2014 16
Summary
É Data integration is the largest portion of building a data warehouseÉ Ensure data quality by applying constraints and testsÉ Monitor your data integration process
,
Project-A, DWH Data Integration, 2014 17
For Further Reading I
Ralph KimballThe Data Warehouse Toolkit.Wiley, 2013.
,
Project-A, DWH Data Integration, 2014 18