part 3 - data warehousing lecture at bw cooperative state university (dhbw)

98
Andreas Buckenhofer Data Warehouse (Datenbanken II)

Upload: andreas-buckenhofer

Post on 07-Jan-2017

61 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Andreas Buckenhofer

Data Warehouse (Datenbanken II)

Page 2: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Overview of the lecture

Data Warehouse / DHBW / Fall 2016 / Page 2

1. Introduction to DWH, DWH Architectures - 20.10.2016

2. Data Modeling, OLAP 1 - 27.10.2016

3. OLAP 2, ETL - 03.11.2016

4. Metadata, DWH Projects, Advanced Topics - 10.11.2016

Page 3: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

What you will learn today

Data Warehouse / DHBW / Fall 2016 / Page 3

• After the end of this lecture you will be able to

• Understand advanced topics of OLAP

• Understand data integration

• ETL

• Data quality

Page 4: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

OLAP

Data Warehouse / DHBW / Fall 2016 / Page 4

Page 5: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

How to cover data changes?

Data Warehouse / DHBW / Fall 2016 / Page 5

• Data changes, e.g.

• new employees

• employees change departments

• employees leave

• whole department reorganisations, etc

• How are the changes handled?

• What does the business want to see? (Reporting Scenarios)

• How is data inserted / updated in dimensions? (Slowly Changing Dimensions)

Page 6: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Reporting scenarios

Data Warehouse / DHBW / Fall 2016 / Page 6

• As-is scenario

• As-of scenario

• As-posted scenario

• As-posted with comparable data scenario

Page 7: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Mart – example baseline

Data Warehouse / DHBW / Fall 2016 / Page 7

Employee Organisation

Miller DWH

Rogers DWH

Douglas Database

Powell Database

Em

plo

yee

Dim

en

sio

n 2

015

Employee Organisation

Miller DWH

Rogers DWH

Powell DWH

Douglas Database

Bush Database

Em

plo

yee

Dim

en

sio

n 2

016

Employee Year #Pro-

jects

Miller 2015 10

Rogers 2015 10

Douglas 2015 10

Powell 2015 10

Miller 2016 10

Rogers 2016 10

Powell 2016 10

Douglas 2016 10

Bush 2016 10Facts

Page 8: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

As-is scenario

Data Warehouse / DHBW / Fall 2016 / Page 8

Reporting uses current structure

Employee Organisation

Miller DWH

Rogers DWH

Powell DWH

Douglas Database

Bush Database

Em

plo

yee

Dim

en

sio

n 2

016

Employee Year #Pro-

jects

Miller 2015 10

Rogers 2015 10

Douglas 2015 10

Powell 2015 10

Miller 2016 10

Rogers 2016 10

Powell 2016 10

Douglas 2016 10

Bush 2016 10Facts

Organisation #Projects ´15 #Projects ´16

DWH 30 30

Database 10 20

Page 9: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

As-of scenario

Data Warehouse / DHBW / Fall 2016 / Page 9

Reporting uses structure as demanded

Employee Organisation

Miller DWH

Rogers DWH

Douglas Database

Powell Database

Em

plo

yee

Dim

en

sio

n 2

015

Employee Year #Pro-

jects

Miller 2015 10

Rogers 2015 10

Douglas 2015 10

Powell 2015 10

Miller 2016 10

Rogers 2016 10

Powell 2016 10

Douglas 2016 10

Bush 2016 10Facts

Organisation #Projects ´15 #Projects ´16

DWH 20 20

Database 20 20

Page 10: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

As-posted scenario

Data Warehouse / DHBW / Fall 2016 / Page 10

Reporting uses „historical truth“Employee Year #Pro-

jects

Miller 2015 10

Rogers 2015 10

Douglas 2015 10

Powell 2015 10

Miller 2016 10

Rogers 2016 10

Powell 2016 10

Douglas 2016 10

Bush 2016 10Facts

Organisation #Projects ´15 #Projects ´16

DWH 20 30

Database 20 20

Page 11: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

As-posted with comparable data scenario

Data Warehouse / DHBW / Fall 2016 / Page 11

Reporting uses „historical truth“Employee Year #Pro-

jects

Miller 2015 10

Rogers 2015 10

Douglas 2015 10

Powell 2015 10

Miller 2016 10

Rogers 2016 10

Powell 2016 10

Douglas 2016 10

Bush 2016 10Facts

Organisation #Projects ´15 #Projects ´16

DWH 20 20

Database 10 10

Page 12: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Slowly changing dimensions

Data Warehouse / DHBW / Fall 2016 / Page 12

• Dimensions must absorb changes

• Slowly changing dimensions according to Kimball / Ross (2002):

• SCD Type 0

• no changes, new data is ignored

• SCD Type 1

• See next slides

• SCD Type 2

• See next slides

• SCD Type 3

• See next slides

• And some more SCD types

• Rarely relevant

Page 13: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Slowly changing dimensions – example baseline

Data Warehouse / DHBW / Fall 2016 / Page 13

• Changes:

• New data added: Albert, DWH

• Powell marries and has new name Parker

ID Employee Organisation

1 Miller DWH

2 Powell Database

Em

plo

yee

Dim

en

sio

n

Page 14: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Slowly Changing Dimension Type 1

Data Warehouse / DHBW / Fall 2016 / Page 14

• No History

• Dimension attributes always contain current data

Em

plo

yee

Dim

en

sio

n

ID Employee Organisation

1 Miller DWH

2 Parker Database

3 Albert DWH

Em

plo

yee

Dim

en

sio

n

• Changes:

• New data added: Albert, DWH

• Powell marries and has new name Parker

ID Employee Organisation

1 Miller DWH

2 Powell Database

Page 15: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Slowly Changing Dimension Type 3

Data Warehouse / DHBW / Fall 2016 / Page 15

• Historization of latest change only

• And storage of current value

Em

plo

yee

Dim

en

sio

n

ID Employee Name Previous Name Organisation Previous

Organisation

1 Miller NULL DWH NULL

2 Parker Powell Database NULL

3 Albert NULL DWH NULL

Em

plo

yee

Dim

en

sio

n

• Changes:

• New data added: Albert, DWH

• Powell marries and has new name Parker

ID Employee Organisation

1 Miller DWH

2 Powell Database

Page 16: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Slowly Changing Dimension Type 2

Data Warehouse / DHBW / Fall 2016 / Page 16

• Full Historization

• Dimension contains timestamps

Em

plo

yee

Dim

en

sio

n

ID Employee Organisation Valid From Valid To

1 Miller DWH 01.01.2015 NULL (or 31.12.9999)

2 Powell Database 21.12.2014 15.10.2016

3 Albert DWH 05.03.2014 NULL (or 31.12.9999)

2 Parker Database 15.10.2016 NULL (or 31.12.9999)

Em

plo

yee

Dim

en

sio

n

• Changes:

• New data added: Albert, DWH

• Powell marries and has new name Parker

ID Employee Organisation

1 Miller DWH

2 Powell Database

Page 17: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Dimension types: Conformed dimension

Data Warehouse / DHBW / Fall 2016 / Page 17

• Dimension that is used in several fact tables

• Fact tables can be connected by using conformed dimensions

Sales

Fact

Inventory

Fact

Product

Dimension

Location

Dimension

Page 18: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Dimension types: Conformed dimension

Data Warehouse / DHBW / Fall 2016 / Page 18

• Kimball: Enterprise DWH Bus Matrix is a “design tool” to document the

organization’s processes

Date Product Location Customer Promotion

Sales Fact X X X X X

Inventory Fact X X X

Customer

Returns Fact

X X X X

Sales Forecast

Fact

X X X

Page 19: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Dimension types: Junk dimension

Data Warehouse / DHBW / Fall 2016 / Page 19

• Collection of random codes that could also form it’s own dimension

ID MartialStatus Gender

1 Single Male

2 Single Female

3 Married Male

4 Married Female

Page 20: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Dimension types: Role-playing dimension

Data Warehouse / DHBW / Fall 2016 / Page 20

• A single dimension is referenced several times by a fact table

• E.g. several dates in fact table reference Date Dimension

ID OrderDate DeliveryDate ProductionDate

1 .. .. ..

2 .. .. ..

3 .. .. ..

4 .. .. ..

Page 21: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Dimension types: Degenerated dimension

Data Warehouse / DHBW / Fall 2016 / Page 21

• A dimension without own dimension table. Data are stored in the fact table only.

• Used e.g. for drill-through in reports

• E.g. OrderNumber in sales fact table

ID OrderNumber

1 A51273 .. ..

2 72841 .. ..

3 732GT5 .. ..

4 624TR5K .. ..

Page 22: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Temporal data storage (Bitemporal data)

Data Warehouse / DHBW / Fall 2016 / Page 22

10.09. 20.09. 30.09. 10.10.

Time

Price: 15EUR Price: 16EUR

New Price of 16EUR is

entered into the DB

Valid

Time

(20.09.)

Transaction

Time

(10.09.)

Page 23: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Temporal data storage (Bitemporal data)

Data Warehouse / DHBW / Fall 2016 / Page 23

• Valid time is the time period during which a fact is true in the real world.

• Transaction time is the time period during which a fact stored in the database was

known.

• Bitemporal data combines both Valid and Transaction Time.

• Source: (Wikipedia, https://en.wikipedia.org/wiki/Temporal_database)

Page 24: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Temporal data storage (Bitemporal data)

Data Warehouse / DHBW / Fall 2016 / Page 24

• SQL standard SQL:2011

• But different implementations by RDBMSes like Oracle, DB2, SQL Server and others

• Different syntax!

• Different coverage of standard!

• Very useful for slowly changing dimensions type 2, but also for other purposes

Page 25: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

DB2 Valid Time example

Data Warehouse / DHBW / Fall 2016 / Page 25

CREATE TABLE customer_address

( customerID INTEGER NOT NULL

, name VARCHAR(100)

, city VARCHAR(100)

, valid_start DATE NOT NULL

, valid_end DATE NOT NULL

, PERIOD BUSINESS_TIME(valid_start, valid_end)

, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS)

);

Page 26: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

DB2 Valid Time example

Data Warehouse / DHBW / Fall 2016 / Page 26

INSERT INTO customer_address VALUES

(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');

UPDATE customer_address

FOR PORTION OF BUSINESS_TIME

FROM '22.05.2013' TO '31.12.2013'

SET city = 'San Diego'

WHERE customerID = 1;

customerID Name City Valid_start Valid_end

1 Miller Seattle 01.01.2013 22.05.2013

1 Miller San Diego 22.05.2013 31.12.2013

Page 27: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

DB2 Valid Time example

Data Warehouse / DHBW / Fall 2016 / Page 27

SELECT *

FROM customer_address

FOR BUSINESS_TIME AS OF '17.05.2013';

Page 28: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

DB2 Transaction Time example

Data Warehouse / DHBW / Fall 2016 / Page 28

CREATE TABLE customer_info(

customerId INTEGER NOT NULL,

comment VARCHAR(1000) NOT NULL,

sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS

ROW BEGIN,

sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS

ROW END,

PERIOD SYSTEM_TIME (sys_start, sys_end)

);

Page 29: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

DB2 Transaction Time example

Data Warehouse / DHBW / Fall 2016 / Page 29

Transaction on 15.10.2013:

INSERT INTO customer_info VALUES(

1, 'comment 1'

);

Transaction on 31.10.2013

UPDATE customer_address SET comment = 'comment 2‘

WHERE customerID = 1;

CustomerId comment Sys_start Sys_end

1 Comment 2 31.10.2013 31.12.2999

Page 30: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

DB2 Transaction Time example

Data Warehouse / DHBW / Fall 2016 / Page 30

SELECT *

FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';

Data comes from a history table:

Valid Time and Transaction Time can be combined = Bitemporal table

CustomerId comment Sys_start Sys_end

1 Comment 1 15.10.2013 31.12.2999

Page 31: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Hierarchies – non-normalized hierarchy table

Data Warehouse / DHBW / Fall 2016 / Page 31

Page 32: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Hierarchies

Data Warehouse / DHBW / Fall 2016 / Page 32

� Hierarchy data in one or more tables

� the dimension table(s)

� MOLAP: Aggregated values for each hierarchy level stored

� ROLAP: Aggregated values dynamically calculated

� i.e. through SQL built-in aggregation functions

� Storage of aggregated data in ROLAP:

� Fact data record with aggregated data

� Materialized view/query table

Page 33: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Aggregations - Types of facts/measures

Data Warehouse / DHBW / Fall 2016 / Page 33

• Additive

• Can be summed up through all of the dimensions in the fact table.

• Example: Retail data warehouse:

• Dimensions: time, location, customer and product

• Measure: sales amount

• Semi-Additive

• Can be summed up for some of the dimensions in the fact table only.

• Example: Banking data warehouse

• Dimensions: time, account

• Measure: current balance

Page 34: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Aggregations - Types of facts/measures

Data Warehouse / DHBW / Fall 2016 / Page 34

• Non-Additive:

• Cannot be summed up for any of the dimensions present in the fact table

• Example: Retail data warehouse:

• Dimensions: time, location, customer and product

• Measure: ratios

• Can be computed from additive or semi-additive facts

Page 35: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Aggregation function

Data Warehouse / DHBW / Fall 2016 / Page 35

• Have to be defined for each measure and dimension

• Sum is the most frequently used

• Other possible aggregation function

• count

• average

Page 36: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Types of fact tables

Data Warehouse / DHBW / Fall 2016 / Page 36

• Transactional

• Most common

• Usually one row per line/event in a transaction

• Most detailed level

• The grain must (should) be the same for all rows

• E.g. sales data

• Periodic snapshots

• Picture of the time

• Often computed from transactional fact table, e.g. aggregated by month

• The grain must (should) be the same for all rows

• E.g. inventory data (summed up for each day)

Page 37: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Types of fact tables

Data Warehouse / DHBW / Fall 2016 / Page 37

• Accumulating snapshots

• Shows activity of a process/event over time

• The data is not complete at the beginning and is updated as soon as new data

arrived (e.g. delivery date can be unknown at the beginning)

• The grain must (should) be the same for all rows

• E.g. processing an order

Page 38: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

MDX - OLAP Query Language

Data Warehouse / DHBW / Fall 2016 / Page 38

• ROLAP = SQL is standard language

• MOLAP = MDX - Multidimensional Expressions

• De-facto industry standard developed by Microsoft

• Very complex

• SQL like syntax

• Language elements

• Scalar – data type „string“ or „number“

• Dimension

• Hierarchy

• Level

• Member

• …

Page 39: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

MDX Sample Query

Data Warehouse / DHBW / Fall 2016 / Page 39

SELECT

{ [Measures].[Store Sales] } ON COLUMNS,

{ [Date].[2002], [Date].[2003] } ON ROWS

FROM Sales

WHERE ( [Store].[USA].[CA] )

• This query defines the following result set information:

• The SELECT clause sets the query axes as the Store Sales (amount) member and

the 2002 and 2003 members of the Date dimension.

• The FROM clause indicates that the data source is the Sales cube.

• The WHERE clause defines the "slicer axis" as the California member of the Store

dimension.

Store Sales

2002 95863,66

2003 99764,01

Page 40: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

OLAP Engines

Data Warehouse / DHBW / Fall 2016 / Page 40

• Middleware between

• Reporting/BI Frontend tool and

• (relational or multidimensional) data store

• Provide a logical multidimensional view on OLAP cubes independently of their storage

scheme

• Holds OLAP metadata (dimensions, hierarchies, measures, ..)

• Usually support MDX through corresponding application programming interfaces

• ODBO - OLE DB for OLAP

• XMLA – XML for Analysis

• E.g. IBM Cognos TM1, Oracle Essbase, Microsoft Analysis Services, Oracle OLAP

Option, IBM Cognos Powerplay

Page 41: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Sample OLAP Engine: Cubing Services in IBM InfoSphere

Warehouse 9.5.x

Data Warehouse / DHBW / Fall 2016 / Page 41

Page 42: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Cube Server in Action – Startup

Data Warehouse / DHBW / Fall 2016 / Page 42

Page 43: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Cube Server in Action – Query Processing

Data Warehouse / DHBW / Fall 2016 / Page 43

Page 44: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ROLAP: Cognos Report Studio example

Data Warehouse / DHBW / Fall 2016 / Page 44

Page 45: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise

Data Warehouse / DHBW / Fall 2016 / Page 45

• We designed two data models in the last session

• Data Vault

• Star Schema

• The customer in the business department has additional requirements:

• An engine must be added to both data model. An engine has a unique identification

number

• Cars can have several engines types nowadays at the same time, “classic” engine +

electric engine. Assume that it is sufficient to reference 2 engines at the same time

• Cars can have several engines over time, e.g. an engine is replaced because of a

defect

• Enlarge both data models with the new requirements

• What happens to existing data if the model is enlarged? Is a migration necessary?

Page 46: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise

Data Warehouse / DHBW / Fall 2016 / Page 46

Page 47: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise

Data Warehouse / DHBW / Fall 2016 / Page 47

Page 48: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise Data Vault, possible solution

Data Warehouse / DHBW / Fall 2016 / Page 48

Page 49: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise Star Schema, possible solution

Data Warehouse / DHBW / Fall 2016 / Page 49

Page 50: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise: What happens to existing data if the model is

enlarged? Is a migration necessary?

Data Warehouse / DHBW / Fall 2016 / Page 50

• Data Vault:

• Additional tables only, no reload of data necessary or other changes

• Star Schema

• Migration necessary

• Changes in Fact table (new fields / foreign keys)

• Existing data must be changed (updated)

• usually reloaded as updates on high amounts of data is slow

• Existing code to load data must be changed

Page 51: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL – Extract, Transform, Load

Data Warehouse / DHBW / Fall 2016 / Page 51

Page 52: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Warehouse

FrontendBackend

External data

sources

Internal data

sources

Standard Data Warehouse Architecture

Data Warehouse / DHBW / Fall 2016 / Page 52

Staging Layer

(Input Layer)

Core Warehouse

Layer

(Storage Layer)

Reporting Layer

(Output Layer)

(Mart Layer)

? ? ?

Page 53: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL Process

Data Warehouse / DHBW / Fall 2016 / Page 53

• Monitor changes in source systems

• Other term: Data integration

Page 54: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Tasks of the ETL Process

Data Warehouse / DHBW / Fall 2016 / Page 54

• Extract

• capture and copy data from source systems (e.g. operational systems)

• many different types of sources

• Relational, hierarchical DBMSs

• Flat files

• Other internal/external sources

• Transform

• Filter data

• Integrate data

• Check and cleanse data

• Load

• Fast load into staging area or another layer

Page 55: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL vs ELT

Data Warehouse / DHBW / Fall 2016 / Page 55

Extract – Transform – Load

• ETL often used for data integration in general (for ETL and ELT)

• But:

• ELT is differentiated from ETL

Source

DB

Target

DB

ETL Server

Source

DB

Target

DB

ELT Server

Datenfluss

Page 56: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL vs ELT

Data Warehouse / DHBW / Fall 2016 / Page 56

ETL ELT

Data is transferred to ETL server and transferred

back to DB. High network bandwidth required

Data remains in the DB except for cross

Database loads (e.g. source to target)

Transformations are performed in the ETL Server Transformations are performed (in the source or)

in the target

Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT

Typically used for

• source to target transfer

• Compute intensive transformations

• Small amount of data

Typically used for

• High amounts of data

Page 57: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL Tool vs manual ETL

Data Warehouse / DHBW / Fall 2016 / Page 57

ETL Tool Manual ETL

Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.

Separate license No additional license

Workflow, error handling, and restart/recovery

functionality included

Workflow, error handling, and restart/recovery

functionality must be implemented manually

Impact analysis and where-used (lineage)

functionality available

Impact analysis and where-used (lineage)

functionality difficult

Faster development, easier maintenance Slower development, more difficult maintenance

Additional (Tool-) Know How required Know How often available

Page 58: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL Server

Data Warehouse / DHBW / Fall 2016 / Page 58

Extract

services

Load

services

Operations management services

Scheduler Control Repository Management

Connectors

Sorter

Connector

Sorter

Bulk Loader

Data Profiling servicesSource analysis

Data Quality servicesData cleansing

Data Transformation and Integration services

Data mapping Business rules

Slowly Changing Dimensions

Datatype conversion

Lookups

Job Monitoring Auditing Error Handling

Security

Page 59: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Mapping - Informatica

Data Warehouse / DHBW / Fall 2016 / Page 59

Source TargetFilter

Lookup

Page 60: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Mapping with Transformations - Informatica

Data Warehouse / DHBW / Fall 2016 / Page 60

SorterAggregator

Transformation

Union

Transformation

Page 61: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Workflow - Informatica

Data Warehouse / DHBW / Fall 2016 / Page 61

Session containing

Mapping

Decision &

coordination step

Session containing

Mapping

Session containing

Mapping

Page 62: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Job Monitoring - Informatica

Data Warehouse / DHBW / Fall 2016 / Page 62

Page 63: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Monitoring (Data change detection)

Data Warehouse / DHBW / Fall 2016 / Page 63

• Extracts from source systems

• Initial extract for setting up the data warehouse

• Initial Load

• Periodical extracts for adding new/changed information to the data warehouse

• Incremental Load

• Question: How to determine what is new or what has changed in the source systems?

� Task of „monitoring“

Page 64: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Monitoring: net effect of changes

Data Warehouse / DHBW / Fall 2016 / Page 64

• Discovery of all changes vs. determining the net effect at extract/load time only

• Example: an attribute value can be changed in two ways:

• by one update operation

• by one delete and one insert operation

• The net effect of both is the same

• However, history information is lost if the net effect is recorded only

Page 65: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise

Data Warehouse / DHBW / Fall 2016 / Page 65

• Which techniques can be used to identify changes in a source system (RDBMS)?

• E.g. in OLTP system

• new products are inserted

• customer address changes

• Product is deleted because it is out of stock

• How would you identify such changes? List advantages / disadvantages of possible

solutions

• Think about making changes in the source system. Think also about other solutions

without any change in the source system.

Page 66: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Monitoring techniques

Data Warehouse / DHBW / Fall 2016 / Page 66

• Depend on characteristics of the data sources

• The following techniques are based on modern relational DBMS

• Types of techniques

• Based on DBMS

• Trigger-based

• Log-based discovery

• Replication techniques

• Controlled by application

• Timestamp-based discovery

• Snapshot-based discovery

Page 67: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Trigger-based

Data Warehouse / DHBW / Fall 2016 / Page 67

• Active monitoring mechanisms

• Based on (database) triggers

• Example:

• If new record is inserted in sales transaction table then insert transaction id

and timestamp in change table

• Advantage:

• Triggers do not change operational applications

• Disadvantage:

• Performance impact on operation systems if triggers are used extensively

• Triggers have to be implemented for every table in the source systems

Page 68: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Trigger-based

Data Warehouse / DHBW / Fall 2016 / Page 68

• Sample Trigger Code, Oracle

CREATE [OR REPLACE] TRIGGER <trigger_name>

{BEFORE|AFTER} {INSERT|DELETE|UPDATE}

ON <table_name>

[REFERENCING [NEW AS <new_row_name>] [OLD AS

<old_row_name>]]

[FOR EACH ROW [WHEN (<trigger_condition>)]]

<trigger_body>

• Trigger is created for each source table in OLTP DB and stores insert/update/delete

changes in a “log/journal table”

• trigger body contains insert statements into log/journal table

Page 69: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Log-based

Data Warehouse / DHBW / Fall 2016 / Page 69

• Log-based discovery

• Also known as CDC (Change Data Capture)

• Usage of database logs to determine changes

• DBMSs write transaction logs in order to be able to undo partially executed

transactions

• This information can be used to determine all changes

• Log reader identifies insert, update, delete, truncates and writes the changes as

inserts into staging layer

• Transaction Log files can be transferred to other systems to avoid additional load on

source systems

Page 70: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Log-based (sample product architecture IIDR)

Data Warehouse / DHBW / Fall 2016 / Page 70

Fro

nte

nd

Standard

Reports

AdHoc

ReportsLogs

IIDR

ReplEngine

Source

Datastore

Source

OLTP

DBIIDR ReplEngine

DWH

Datastore

DWH

DWH DB

Staging Layer

Core Layer

Mart Layer

Page 71: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Replication-based

Data Warehouse / DHBW / Fall 2016 / Page 71

• Replication techniques

• Data replication

• Target tables not necessarily on local system

• Uses typically Transaction Logs

• Log reader identifies insert, update, delete, truncates and writes the changes into

replicated tables (insert remains insert, update remains update, etc)

• Useful for 1:1 copies but still challenge to detect changes for loading the data

mart

Page 72: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Replication-based (sample product architecture IIDR)

Data Warehouse / DHBW / Fall 2016 / Page 72

Fro

nte

nd

Standard

Reports

AdHoc

ReportsLogs

IIDR

ReplEngine

Source

Datastore

Source

OLTP

DB

IIDR ReplEngine

Spiegel

Datastore

Spiegel

DWH DB

Staging Layer

Core Layer

Mart Layer

Mirror

DB

Page 73: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Timestamp-based

Data Warehouse / DHBW / Fall 2016 / Page 73

• Timestamp-based discovery

• Every data item in a table is associated with timestamp information about its validity

period

• Changed data can be determined from this timestamp information

• Operational applications have to keep a limited change history

Page 74: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Timestamp-based

Data Warehouse / DHBW / Fall 2016 / Page 74

• Sample customer table in OLTP

• Each table gets Change timestamp

• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)

• Problem: it is not possible to identify deleted rows

CustomerID Name Department Change Timestamp

1 Miller DWH 15.01.2015 17:00:01

2 Powell DB 22.03.2016 08:30:22

Page 75: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Snapshot-based

Data Warehouse / DHBW / Fall 2016 / Page 75

• Data comparison

• Comparison of snapshots of the operational data at different points in time

• Compute difference between two latest snapshots

• E.g. unload all data from a table into a file and diff newest file content with latest

file content

• Can be very complex

• Sometimes the only possibility, for instance for legacy applications

• High performance impact on source

Page 76: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Monitoring techniques comparison

Data Warehouse / DHBW / Fall 2016 / Page 76

Trigger-based Replication

techniques

Log-based

discovery

Timestamp-

based

discovery

Snapshot-

based

discovery

Performance

impact on

source system

Medium Low Low Medium High

Performance

impact on

target system

Low Low Low Low High

Load on

network

Low Low Low Low High

Dataloss if

nologging

operations

No Yes Yes No No

Page 77: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Monitoring techniques comparison

Data Warehouse / DHBW / Fall 2016 / Page 77

Trigger-based Replication

techniques

Log-based

discovery

Timestamp-

based

discovery

Snapshot-

based

discovery

Identify

DELETE

operations

Yes Yes Yes No Yes

Identify ALL

changes

(changes

between

extractions)

Yes Yes Yes No No

Near Real-Time

ready

Maybe Yes Yes Unlikely Unlikely

Page 78: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Transport

Data Warehouse / DHBW / Fall 2016 / Page 78

• Direct Access

• Source writes data into target or

• Target reads data from source

• Security concerns

• High coupling / dependencies

• File transfer (or other transport medium)

• csv, json, xml, binary, etc

• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus),

SOA (service oriented architecture), etc

• Often high amounts of data, therefore bulk transfer of compressed data most widely

used

• Better decoupling of source and target

Source Target

Source Target

Page 79: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Extraction intervals

Data Warehouse / DHBW / Fall 2016 / Page 79

• Extraction intervals

• Periodically – in regular intervals

• Every day, week, etc.

• Instantly / Continuous

• Every change is directly propagated into the data warehouse

• „real time data warehouse“

• Depends on the requirements on timeliness of the data warehouse data

• Triggered by a specific request

• Addition of a new product

• Query which involves more recent data

• Triggered by specific events

• Number of changes in operational data exceeds threshold

Page 80: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Prerequisite of ETL - Understanding The Data

Data Warehouse / DHBW / Fall 2016 / Page 80

• Profile Existing Data Sources, Extracted Data

• Analyze data structure, content, and quality

• Find data relationships across systems

• Often badly documented or missing foreign keys

• Uncover data issues that can affect subsequent transformation steps

• Missing values

• Duplicates

• Inconsistencies

Page 81: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Exercise

Data Warehouse / DHBW / Fall 2016 / Page 81

• For one of the following companies

• Bank

• Telecommunication company

• Online book store (like Amazon.com)

• Supermarket

describe 5 potential data quality problems.

• What could be done to prevent these problems?

• Which impact might these problems have on its business?

Page 82: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues

Data Warehouse / DHBW / Fall 2016 / Page 82

CustomerNo Name Birthdate Age Gender Zip code

1 Miller, Tom 33.01.2001 15 M NULL

1 John Mayor 15.01.2001 15 M 98144

2 Mrs. Bush 31.10.1988 22 Q 00000

PK / Unique Key

violatedData not uniform Not valid

Inconsistent Wrong value

Unknown / missing

FK violated

Page 83: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Data Quality issues: solutions (prevention) in OLTP

Data Warehouse / DHBW / Fall 2016 / Page 83

Page 84: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues: workarounds in DWH

Data Warehouse / DHBW / Fall 2016 / Page 84

• Correcting the data

• Automatically during ETL

• E.g., address of a customer if a correct reference table exists

• Manually after ETL is finished

• ETL stored bad data in error log tables or files

• ETL flags bad data (e.g. invalid)

• At the source systems

• Common master data management across all operational applications

• Dedicated systems are “master” of e.g. customer data

• Correcting the data at the source is best approach but slow and often not

feasible

Page 85: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues: missing data

Data Warehouse / DHBW / Fall 2016 / Page 85

• Column is null

• Reject data

• Use default values

• Missing values can represent

• an unknown value

• Iike date of birth of a customer

• a value that does not exist

• like „engine type“ for bicycle in a vehicles table

• Dimension tables can include some dummy values:

DimensionTable_X Description

-1 Unknown

-2 Missing

Page 86: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues: dirty data

Data Warehouse / DHBW / Fall 2016 / Page 86

• Data is inaccurate, e.g. wrong date 32.12.2015 or wrong number 55U

• Reject data

• Replace with value that represents „Invalid“

• Dimension tables can include some dummy values:

DimensionTable_X Description

-1 Unknown

-2 Missing

-3 Invalid

Page 87: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues: conflicting data

Data Warehouse / DHBW / Fall 2016 / Page 87

• Data has conflicts, e.g. wrong postal code 80995 Stuttgart

• Reject data

• Replace one of the values with a value that represents „Invalid“

• Which value to replace? Rules necessary

Page 88: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues: inconsistent data

Data Warehouse / DHBW / Fall 2016 / Page 88

• Data is inconsistent, e.g. Order date after payment date or unlikely high price for a

product

• Can be discovered by statistical and data mining methods

Page 89: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Quality issues: duplicates

Data Warehouse / DHBW / Fall 2016 / Page 89

• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”

• Multiple representations for one entity

• Keys

• Encodings

• Duplicate detection can be very difficult / tricky

• Standardize / Harmonize data during ETL flow: “unification” for better duplicate

detection

Page 90: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Transform - Unification of data

Data Warehouse / DHBW / Fall 2016 / Page 90

• Unification of data types

• Character string � date „20.01.2006“ � 20.01.2006

• Character string � number „12345“ � 12345

• Unification of encodings

• For instance for gender F and M

• Lookup-tables contain the mapping from old to new encodings

• Unification of names:

• „last name“, „first name“ like „Maier, Peter“

� separate into “Peter” and “Maier”

• Can become very challenging “Herr Prof. Dr. Hans M. vom und zum Stein” or

“Werner Martin” or “Mariae Gloria … Wilhelmine Huberta Gräfin von Schönburg-

Glauchau“

Page 91: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Transform - Unification of data

Data Warehouse / DHBW / Fall 2016 / Page 91

• Unification of dates and timestamps

• Rules for representing incomplete date information

• If only month and year are known

• Dates and timestamps with regard to one specific timezone

• Important for multi-national organizations

• UTC Coordinated Universal Time without daylight saving zone

• Combination of different attributes to one attributes

• day, month, year � date

• Split of one attribute into two or more

• Name � first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)

• Product name - „Cola, 0.33 l“

� Product short name - „Cola“, size in liters - 0.33

Page 92: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Transform - Unification of data

Data Warehouse / DHBW / Fall 2016 / Page 92

• Computation of derived values

• Profit = sales price – purchase price

• Without clear definition, different interpretations possible

• Net or gross sales price?

• Net or gross purchase price?

• Aggregations

• Revenue of the year computed from revenues of the day

• Without clear definition, different interpretations possible

• Calendar year?

• Fiscal year?

Page 93: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Mapping

Data Warehouse / DHBW / Fall 2016 / Page 93

• Specification between source and target columns

• Source tables + columns

• Target table + columns

• Join rules

• Filter criteria

• Transformation rules

Page 94: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Load

Data Warehouse / DHBW / Fall 2016 / Page 94

• Efficient load operations are important

• bulk load or bulk processing in general

• Single row processing vs set based processing

• Online load

• Data warehouse (especially Data Mart) is still accessible

• For incremental updates

• Offline load

• Data warehouse (especially Data Mart) is offline

• For updates that require the recomputation of a cube

• Offline load is often a Tool limit because the Tool locks data structures

• Because of such locks, the offline load is normally faster. Therefore the Offline

load is often run instead of an online load if there were many data changes

Page 95: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Bulk processing

Data Warehouse / DHBW / Fall 2016 / Page 95

• Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or

LOAD command in DB2

• Single row vs set based processing

Single row processing Set based processing

Cursor curs = SELECT * FROM <source>

WHILE NOT EOF(curs)

FETCH NEXT ROW INTO myRoW;

INSERT INTO <target> VALUES(myRow);

LOOP

INSERT INTO <target>

SELECT * from <source>

Error handling easy All or nothing if there are errors

Slow for high amounts of data Performs well for small and high amounts of data

More coding Less code = less errors

Page 96: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

ETL-Job parallelism for loading data into Core Warehouse

Layer

Data Warehouse / DHBW / Fall 2016 / Seite 96

HU

B lo

ad

ed

LIN

K u

nd

HU

B-

SA

Tlo

ad

ed

LIN

K-S

AT

load

ed

Da

ta V

au

lt

Lo

ad

Cla

ssic

al

Lo

ad

?

? ?

Integration of new JobsTime Windows for Loads, e.g 00:00-06:00

• Complex

• Many dependencies

• Many sequential jobs

• Systematic / Methodic

• Few, well defined dependencies

• Massive parallel

Page 97: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Daimler TSS GmbH

Data Warehouse

FrontendBackend

External data

sources

Internal data

sources

Standard Data Warehouse Architecture

Data Warehouse / DHBW / Fall 2016 / Page 97

Staging Layer

(Input Layer)

Core Warehouse

Layer

(Storage Layer)

Reporting Layer

(Output Layer)

(Mart Layer)

? ? ?

Page 98: Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)

Thank you!

Daimler TSS GmbH

Wilhelm-Runge-Straße 11, 89081 Ulm, Germany / Phone +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com / Intranet portal code: @TSS

Domicile and Court of Registry: Ulm / Commercial Register No.: 3844 / Management: Christoph Röger (Vorsitzender), Steffen Bäuerle