lecture @dhbw: data warehouse part xx: …buckenhofer/20182dwh/bucken...etl vs elt daimler tss data...

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART XX: DATA ENGINEERINGANDREAS BUCKENHOFER, DAIMLER TSS

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics



mailto:[email protected]

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH

Data Warehouse / DHBWDaimler TSS 3

“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”

Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.

I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.

I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.

DHBWDOAG

xing

Contact/Connect

mailto:[email protected]






https://www.doag.org/de/themen/datenbank/in-memory/


https://www.xing.com/profile/Andreas_Buckenhofer2

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 5

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS China

Hub Beijing

10 employees

Daimler TSS Malaysia

Hub Kuala Lumpur

42 employeesDaimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

6

After the end of this lecture you will be able to

Understand concepts behind ETL

WHAT YOU WILL LEARN TODAY


DATA SCIENTIST SEXIEST JOB OF THE 21ST CENTURY?


Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-article

https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-article

DATA SCIENTIST VS DATA ENGINEERING VS SOFTWARE ENGINEERING


Source: 2018 Enterprise Almanah

THE MACHINE LEARNING PIPELINEAKA “DATA ENGINEERING” AKA “DATA INTEGRATION”


LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE


Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

? ? ? ?

Extract – Transform - Load

Other term: Data integration (better, more neutral)

ETL PROCESS


• capture and copy data from source systems (e.g. operational systems)

• many different types of sources • Relational, hierarchical DBMSs

• Flat files

• Other internal/external sources

TASKS OF THE ETL PROCESS - EXTRACT


• Filter data

• Integrate data

• Check and cleanse data

TASKS OF THE ETL PROCESS - TRANSFORM


• Original meaning: Fast load into staging area

• General meaning: Loading data into staging area or another layer

TASKS OF THE ETL PROCESS - LOAD


ETL often used for data integration in general (for ETL and ELT)

But: if ELT is mentioned, it is differentiated from ETL

ETL VS ELT


SourceDB

TargetDB

ETL Server

SourceDB

TargetDB

ELT Server

Data flow

ETL VS ELT


ETL ELT

Data is transferred to ETL server and transferred back to DB. High network bandwidth required

Data remains in the DB except for cross Database loads (e.g. source to target)

Transformations are performed in the ETL Server Transformations are performed (in the source or) in the target

Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT

Typically used for • source to target transfer • Compute intensive transformations• Small amount of data

Typically used for • High amounts of data

ETL/ELT TOOL VS MANUAL ETL/ELT


ETL Tool Manual ETL

Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.

Separate license No additional license

Workflow, error handling, and restart/recovery functionality included

Workflow, error handling, and restart/recovery functionality must be implemented manually

Impact analysis and where-used (lineage) functionality available

Impact analysis and where-used (lineage) functionality difficult

Faster development, easier maintenance Slower development, more difficult maintenance

Additional (Tool-) Know How required Know How often available

ETL/ELT TOOL VS MANUAL ETL/ELT


Extract servicesLoad

services

Operations management services

Scheduler Control Repository Management

Connectors

Sorter

Connector

Sorter

Bulk Loader

Data Profiling servicesSource analysis

Data Quality servicesData cleansing

Data Transformation and Integration services

Data mapping Business rules

Slowly Changing Dimensions

Datatype conversion

Lookups

Job Monitoring Auditing Error Handling

Security

MAPPING - INFORMATICA


Source Target

Filter

Lookup

MAPPING WITH TRANSFORMATIONS - INFORMATICA


Sorter

Aggregator Transformation

Union Transformation

• Specification between source and target columns

• Source tables + columns

• Target table + columns

• Join rules

• Filter criteria

• Transformation rules

DATA MAPPING


WORKFLOW - INFORMATICA


Decision & coordination step

Session containing Mapping

JOB MONITORING - INFORMATICA


Extracts from source systems

Initial extract for setting up the data warehouse • Initial Load

Periodical extracts for adding new/changed information to the data warehouse • Incremental Load

Question: How to determine what is new or what has changed in the source systems?

→ Task of „monitoring“

MONITORING (DATA CHANGE DETECTION)


Discovery of all changes vs. determining the net effect at extract/load time only

• Example: an attribute value can be changed in two ways:

• by one update operation

• by one delete and one insert operation

The net effect of both is the same

However, history information is lost if the net effect is recorded only

MONITORING: NET EFFECT OF CHANGES


Which techniques can be used to identify changes in a source system (RDBMS)?

• E.g. in OLTP system

• new products are inserted

• customer address changes

• Product is deleted because it is out of stock

How would you identify such changes? List advantages / disadvantages of possible solutions

Think about making changes in the source system. Think also about other solutions without any change in the source system.

EXERCISE MONITORING


Depend on characteristics of the data sources

The following techniques are based on modern relational DBMS

Types of techniques

Based on DBMS • Trigger-based

• Log-based discovery

• Replication techniques

Controlled by application • Timestamp-based discovery

• Snapshot-based discovery

MONITORING TECHNIQUES


Active monitoring mechanisms

Based on (database) triggers • Example:

• If new record is inserted in sales transaction table then insert transaction id and timestamp in change table

Advantage:

• Triggers do not change operational applications

Disadvantage: • Performance impact on operation systems if triggers are used extensively

• Triggers have to be implemented for every table in the source systems

TRIGGER-BASED


Sample Trigger Code, OracleCREATE [OR REPLACE] TRIGGER <trigger_name>

{BEFORE|AFTER} {INSERT|DELETE|UPDATE}

ON <table_name>

[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]

[FOR EACH ROW [WHEN (<trigger_condition>)]]

<trigger_body>

Trigger is created for each source table in OLTP DB and stores insert/update/delete changes in a “log/journal table”

• trigger body contains insert statements into log/journal table

TRIGGER-BASED


Log-based discovery

Also often referenced as CDC (Change Data Capture)

Usage of database transaction logs to determine changes • DBMSs write transaction logs in order to be able to undo partially executed

transactions

• This information can be used to determine all changes

• Log reader identifies insert, update, delete, truncates and writes the changes as inserts into staging layer

Transaction Log files can be transferred to other systems to avoid additional load on source systems

LOG-BASED


LOG-BASED (SAMPLE PRODUCT ARCHITECTURE IIDR)


Fron

tend

StandardReports

AdHocReports

IIDRReplEngine

Source

DatastoreSource

OLTPDB

IIDR ReplEngineDWH

DatastoreDWH

DWH DB

Staging Layer

Core Layer

Mart Layer

TransactionLogs

Replication techniques

Data replication

• Target tables not necessarily on local system

• Uses typically Transaction Logs

• Log reader identifies insert, update, delete, truncates and writes the changes into replicated tables (insert remains insert, update remains update, etc)

• Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect changes for loading the data mart

REPLICATION-BASED


REPLICATION-BASED (SAMPLE PRODUCT ARCHITECTUREIIDR)


Fron

tend

StandardReports

AdHocReports

IIDRReplEngine

Source

DatastoreSource

OLTPDB

IIDR ReplEngineDWH

DatastoreDWH

DWH DB

Staging Layer

Core Layer

Mart Layer

TransactionLogs

Timestamp-based discovery

• Every data item in a table is associated with timestamp information about its validity period

• Changed data can be determined from this timestamp information

TIMESTAMP-BASED


Sample customer table in OLTP

• Each table gets Change timestamp

• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)

• Problem: it is not possible to identify deleted rows

TIMESTAMP-BASED


CustomerID Name Department Change Timestamp

1 Miller DWH 15.01.2015 17:00:01

2 Powell DB 22.03.2016 08:30:22

Data comparison

Comparison of snapshots of the operational data at different points in time• Compute difference between two latest snapshots

• E.g. unload all data from a table into a file and diff newest file content with latest file content

Can be very complex

Sometimes the only possibility, for instance for legacy applications

High performance impact on source

SNAPSHOT-BASED


MONITORING TECHNIQUES COMPARISON


Trigger-based Replication techniques

Log-based discovery


Snapshot-based discovery

Performanceimpact on source system

Medium Low Low Medium High

Performanceimpact on target system

Low Low Low Low High

Load on network Low Low Low Low High

Data loss if nologgingoperations

No Yes Yes No No

MONITORING TECHNIQUES COMPARISON


Trigger-based Replication techniques

Log-based discovery


Snapshot-based discovery

Identify DELETE operations

Yes Yes Yes No Yes

Identify ALLchanges (changes between extractions)

Yes Yes Yes No No

Direct Access

• Source writes data into target or

• Target reads data from source

• Security concerns

• High coupling / dependencies

DATA TRANSPORT – DIRECT ACCESS


Source Target

File transfer (or other transport medium)

• csv, json, xml, binary, etc

• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus), SOA (service oriented architecture), etc

• Often high amounts of data, therefore bulk transfer of compressed data most widely used

• Better decoupling of source and target

DATA TRANSPORT – FILE TRANSFER


Source Targetfiles

Extraction intervals

• Periodically – in regular intervals

• Every day, week, etc.

• Instantly / Continuous

• Every change is directly propagated into the data warehouse

• „real time data warehouse“

• Depends on the requirements on timeliness of the data warehouse data

EXTRACTION INTERVALS


Triggered by a specific request

• Addition of a new product

• Query which involves more recent data

Triggered by specific events

• Number of changes in operational data exceeds threshold

EXTRACTION INTERVALS


DATA QUALITY


Source: https://twitter.com/markmadsen/status/1059579065164738560?s=21

https://twitter.com/markmadsen/status/1059579065164738560?s=21

• Profile Existing Data Sources, Extracted Data

• Analyze data structure, content, and quality

• Find data relationships across systems

• Often badly documented or missing foreign keys

• Uncover data issues that can affect subsequent transformation steps

• Missing values

• Duplicates

• Inconsistencies

PREREQUISITE OF TRANSFORMATION: UNDERSTANDING THE DATA


DATA PYRAMID AND DATA QUALITY


Source: By Matthew.viel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49310779 LinkedIn 11/2017: https://www.linkedin.com/feed/update/urn:li:activity:6334062387355746304

https://commons.wikimedia.org/w/index.php?curid=49310779

https://www.linkedin.com/feed/update/urn:li:activity:6334062387355746304

DATA QUALITY ISSUES


CustomerNo Name Birthdate Age Gender Zip code

1 Miller, Tom 33.01.2001 15 M NULL

1 John Mayor 15.01.2001 15 M 98144

2 Mrs. Bush 31.10.1988 22 Q 00000

3 Martin 31.10.1988 22 M 75890

PK / Unique Key violated Data not uniform Not valid

Inconsistent Wrong value

Unknown / missing

FK violated

DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THESOURCE RDBMS


Issue Solution

Wrong data e.g. 31.02.2016 Proper data type definition

Wrong values, e.g. number out of range Check constraint

Missing values NOT NULL constraint

Violated references FOREIGN KEY constraint

Duplicates PRIMARY or UNIQUE KEY constraint

Inconsistent data ACID transactions, business logic, additional checks

Correcting the data

• Automatically during ETL• E.g., address of a customer if a correct reference table exists

• Manually after ETL is finished• ETL stored bad data in error log tables or files

• ETL flags bad data (e.g. invalid)

DATA QUALITY ISSUES: WORKAROUNDS IN DWH


Correcting the data

• In the source systems • Common master data management across all operational applications

• Dedicated systems are “master” of e.g. customer data

• Correcting the data at the source is best approach but slow and often not feasible

DATA QUALITY ISSUES: CORRECT DATA IN THE SOURCE


• Column is null

• Reject data

• Use default values

• Missing values can represent

• an unknown value Iike date of birth of a customer

• a missing value like engine_id for a car (logical not null constraint)

• Dimension tables can include some dummy values:

DATA QUALITY ISSUES: MISSING DATA


DimensionTable_X Description

-1 Unknown

-2 Missing

• Data is inaccuratee.g. wrong date 32.12.2015 or wrong number 55U

• Reject data

• Replace with value that represents „Invalid“

• Dimension tables can include some dummy values:

DATA QUALITY ISSUES: MISSING DATA


DimensionTable_X Description

-1 Unknown

-2 Missing

-3 Invalid

• Data has conflicts, e.g. wrong postal code 80995 Stuttgart

• Reject data

• Replace one of the values with a value that represents „Invalid“ or with corrected valueWhich value to replace? Rules necessary

DATA QUALITY ISSUES: CONFLICTING DATA


• Data is inconsistent, e.g. unlikely high price for a product

• Can be discovered by statistical and data mining methods

DATA QUALITY ISSUES: INCONSISTENT DATA


• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”

• Multiple representations for one entity • Different keys

• Different encodings

• Duplicate detection can be very difficult / tricky

• Products are available for e.g. address duplicate detection address validation (Kingstreet = does this address actually exist?)address harmonization (Kingstr, Kingstreet, King Street, etc)

• Standardize / Harmonize data during ETL flow: “unification”

DATA QUALITY ISSUES: DUPLICATES


• Unification of data types

• Character string → date „20.01.2006“ → 20.01.2006

• Character string → number „12345“ → 12345

• Unification of encodings

• For instance for gender F and M

• Lookup-tables contain the mapping from old to new encodings

• Combination of different attributes to one attribute

• day, month, year → date

TRANSFORM - UNIFICATION OF DATA


• Split of one attribute into two or more

• Name → first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)

• Unification of names can become very challenging “Herr Prof. Dr. Hans M. vomund zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine HubertaGräfin von Schönburg-Glauchau“

• Product name - „Cola, 0.33 l“→ Product short name - „Cola“, size in liters - 0.33



• Unification of dates and timestamps

• Rules for representing incomplete date information If only month and year are known

• Dates and timestamps with regard to one specific timezoneImportant for multi-national organizationsUTC Coordinated Universal Time without daylight saving zone

• What can happen if clock is changed to winter time if no UTC is used?- Update arrives at 02:15 in staging layer (CDC / log-based monitor)- Clock is changed to winter time: -1h- Update of the same row arrives at 02:10 in staging layer (CDC / log-based)- How can batch load running the next night discover which update is the most recent one?



• Computation of derived values

• Profit = sales price – purchase price Without clear definition, different interpretations possible

• Net or gross sales price?

• Net or gross purchase price?

• Aggregations

• Revenue of the year computed from revenues of the dayWithout clear definition, different interpretations possible

• Calendar year?

• Fiscal year?



• Efficient load operations are important

• bulk load: Single row processing vs set based processing

• Online load • Data warehouse (especially Data Mart) is still accessible

• Offline load

• Data warehouse (especially Data Mart) is offline

• For updates that require the recomputation of a cube

• Offline load is often a Tool limit because the Tool locks data structures. But offline load could be faster.

LOAD


• Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or LOAD command in DB2

• Single row vs set based processing

BULK PROCESSING


Single row processing Set based processing

Cursor curs = SELECT * FROM <source>WHILE NOT EOF(curs)

FETCH NEXT ROW INTO myRoW;INSERT INTO <target> VALUES(myRow);

LOOP

INSERT INTO <target>SELECT * from <source>

Error handling easy All or nothing if there are errors

Slow for high amounts of data Performs well for small and high amounts of data

More coding Less code = less errors

ETL-JOB PARALLELISM FOR LOADING DATA INTO CORE WAREHOUSE LAYER


HU

B lo

aded

LIN

K u

nd

HU

B-

SAT

load

ed

LIN

K-S

AT

load

ed

Dat

a V

ault

Load

Cla

ssic

alLo

ad

?

? ?

Integration of new JobsTime Windows for Loads, e.g 00:00-06:00

• Complex

• Many dependencies

• Many sequential jobs

• Systematic / Methodic

• Few, well defined dependencies

• Massive parallel

EXAMPLE FOR DATA INTEGRATION IN DATA VAULT 2.0 ARCHITECTURE


Source: Hans Hultgren: Modeling the agile Data Warehouse with Data Vault, New Hamilton 2012, p. 224

Hard Rules only

Soft Rules

Raw Data Vault

BusinessData Vault

ETL (E)T(L) ETL

ETL,

„M

on

ito

rin

g“

Draw a flow diagram how to load a HUB, LINK and SAT table and describe the SQL statements

EXERCISE: LOAD DATA VAULT TABLE


EXERCISE: LOAD HUB TABLE


Source data exist

Load distinctbusiness keys

Doesbusiness

Key exist in HUB?

Insert row intoHUB

Conflict if PK HashKeycollision!

no

Rejectdata

Data loaded intoHUB

yes

INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource)

SELECT DISTINCT f.fahrzeug_hashkey

, f.fin_bk

, f.loaddate

, f.recordsource

FROM staging.fahrzeugdaten f

WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug)

AND f.loaddate = <date to load>;

EXERCISE: LOAD HUB TABLE


EXERCISE: LOAD LINK TABLE


Source data exist

Load distinctbusiness keys

Does Hash Key

relationshipexist in HUB?

Insert row intoLINK

Conflict if PK HashKeycollision!

no

Rejectdata

Data loaded intoLINK

yes

INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource)

SELECT DISTINCT h.verbaut_hk

, f.motor_hashkey

, f.fahrzeug_hashkey

, f.loaddate

, f.recordsource


WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM core.link_verbaut v)

AND loaddate = <date to load>;

EXERCISE: LOAD LINK TABLE


EXERCISE: LOAD SAT TABLE


Source data exist

Load distinctsource

data

MD5-HASH Diff

identical?

Insert row intoSAT

no

Rejectdata

Data loaded intoSAT

yes

Load current/

latest rowfrom SAT

table

INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar)

SELECT DISTINCT f.fahrzeug_hashkey

, f.loaddate

, f.recordsource

, f.md5hash

, f.codeleiste

, f.kommentar


LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate froms_fahrzeug i GROUP BY i.VEHICLE_HK) m

ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk

WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash)

AND f.loaddate = <date to load>;

EXERCISE: LOAD SAT TABLE


LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE


Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

? ? ? ?

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle


THANK YOU

lecture @dhbw: data warehouse part xx: …buckenhofer/20182dwh/bucken...etl vs elt daimler tss data...

Documents