lecture @dhbw: data warehouse part xx: …buckenhofer/20182dwh/bucken...etl vs elt daimler tss data...
TRANSCRIPT
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART XX: DATA ENGINEERINGANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas BuckenhoferSenior DB [email protected]
Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics
ANDREAS BUCKENHOFER, DAIMLER TSS GMBH
Data Warehouse / DHBWDaimler TSS 3
“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”
Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.
I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.
I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.
DHBWDOAG
Contact/Connect
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 5
Daimler TSS
LOCATIONS
Data Warehouse / DHBW
Daimler TSS China
Hub Beijing
10 employees
Daimler TSS Malaysia
Hub Kuala Lumpur
42 employeesDaimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
6
After the end of this lecture you will be able to
Understand concepts behind ETL
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 7
DATA SCIENTIST SEXIEST JOB OF THE 21ST CENTURY?
Data Warehouse / DHBWDaimler TSS 8
Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-article
DATA SCIENTIST VS DATA ENGINEERING VS SOFTWARE ENGINEERING
Data Warehouse / DHBWDaimler TSS 9
Source: 2018 Enterprise Almanah
THE MACHINE LEARNING PIPELINEAKA “DATA ENGINEERING” AKA “DATA INTEGRATION”
Data Warehouse / DHBWDaimler TSS 10
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 11
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse
Layer(Storage
Layer)
Mart Layer(Output Layer)
(Reporting Layer)
Integration Layer
(Cleansing Layer)
Aggregation Layer
Metadata Management
Security
DWH Manager incl. Monitor
? ? ? ?
Extract – Transform - Load
Other term: Data integration (better, more neutral)
ETL PROCESS
Data Warehouse / DHBWDaimler TSS 12
• capture and copy data from source systems (e.g. operational systems)
• many different types of sources • Relational, hierarchical DBMSs
• Flat files
• Other internal/external sources
TASKS OF THE ETL PROCESS - EXTRACT
Data Warehouse / DHBWDaimler TSS 13
• Filter data
• Integrate data
• Check and cleanse data
TASKS OF THE ETL PROCESS - TRANSFORM
Data Warehouse / DHBWDaimler TSS 14
• Original meaning: Fast load into staging area
• General meaning: Loading data into staging area or another layer
TASKS OF THE ETL PROCESS - LOAD
Data Warehouse / DHBWDaimler TSS 15
ETL often used for data integration in general (for ETL and ELT)
But: if ELT is mentioned, it is differentiated from ETL
ETL VS ELT
Data Warehouse / DHBWDaimler TSS 16
SourceDB
TargetDB
ETL Server
SourceDB
TargetDB
ELT Server
Data flow
ETL VS ELT
Data Warehouse / DHBWDaimler TSS 17
ETL ELT
Data is transferred to ETL server and transferred back to DB. High network bandwidth required
Data remains in the DB except for cross Database loads (e.g. source to target)
Transformations are performed in the ETL Server Transformations are performed (in the source or) in the target
Proprietary code is executed in the ETL server Generated code, e.g. SQL, PL/SQL, SQLT
Typically used for • source to target transfer • Compute intensive transformations• Small amount of data
Typically used for • High amounts of data
ETL/ELT TOOL VS MANUAL ETL/ELT
Data Warehouse / DHBWDaimler TSS 18
ETL Tool Manual ETL
Informatica, Talend, Oracle ODI, etc. SQL, PL/SQL, SQLT, etc.
Separate license No additional license
Workflow, error handling, and restart/recovery functionality included
Workflow, error handling, and restart/recovery functionality must be implemented manually
Impact analysis and where-used (lineage) functionality available
Impact analysis and where-used (lineage) functionality difficult
Faster development, easier maintenance Slower development, more difficult maintenance
Additional (Tool-) Know How required Know How often available
ETL/ELT TOOL VS MANUAL ETL/ELT
Data Warehouse / DHBWDaimler TSS 19
Extract servicesLoad
services
Operations management services
Scheduler Control Repository Management
Connectors
Sorter
Connector
Sorter
Bulk Loader
Data Profiling servicesSource analysis
Data Quality servicesData cleansing
Data Transformation and Integration services
Data mapping Business rules
Slowly Changing Dimensions
Datatype conversion
Lookups
Job Monitoring Auditing Error Handling
Security
MAPPING - INFORMATICA
Data Warehouse / DHBWDaimler TSS 20
Source Target
Filter
Lookup
MAPPING WITH TRANSFORMATIONS - INFORMATICA
Data Warehouse / DHBWDaimler TSS 21
Sorter
Aggregator Transformation
Union Transformation
• Specification between source and target columns
• Source tables + columns
• Target table + columns
• Join rules
• Filter criteria
• Transformation rules
DATA MAPPING
Data Warehouse / DHBWDaimler TSS 22
WORKFLOW - INFORMATICA
Data Warehouse / DHBWDaimler TSS 23
Decision & coordination step
Session containing Mapping
JOB MONITORING - INFORMATICA
Data Warehouse / DHBWDaimler TSS 24
Extracts from source systems
Initial extract for setting up the data warehouse • Initial Load
Periodical extracts for adding new/changed information to the data warehouse • Incremental Load
Question: How to determine what is new or what has changed in the source systems?
→ Task of „monitoring“
MONITORING (DATA CHANGE DETECTION)
Data Warehouse / DHBWDaimler TSS 25
Discovery of all changes vs. determining the net effect at extract/load time only
• Example: an attribute value can be changed in two ways:
• by one update operation
• by one delete and one insert operation
The net effect of both is the same
However, history information is lost if the net effect is recorded only
MONITORING: NET EFFECT OF CHANGES
Data Warehouse / DHBWDaimler TSS 26
Which techniques can be used to identify changes in a source system (RDBMS)?
• E.g. in OLTP system
• new products are inserted
• customer address changes
• Product is deleted because it is out of stock
How would you identify such changes? List advantages / disadvantages of possible solutions
Think about making changes in the source system. Think also about other solutions without any change in the source system.
EXERCISE MONITORING
Data Warehouse / DHBWDaimler TSS 27
Depend on characteristics of the data sources
The following techniques are based on modern relational DBMS
Types of techniques
Based on DBMS • Trigger-based
• Log-based discovery
• Replication techniques
Controlled by application • Timestamp-based discovery
• Snapshot-based discovery
MONITORING TECHNIQUES
Data Warehouse / DHBWDaimler TSS 28
Active monitoring mechanisms
Based on (database) triggers • Example:
• If new record is inserted in sales transaction table then insert transaction id and timestamp in change table
Advantage:
• Triggers do not change operational applications
Disadvantage: • Performance impact on operation systems if triggers are used extensively
• Triggers have to be implemented for every table in the source systems
TRIGGER-BASED
Data Warehouse / DHBWDaimler TSS 29
Sample Trigger Code, OracleCREATE [OR REPLACE] TRIGGER <trigger_name>
{BEFORE|AFTER} {INSERT|DELETE|UPDATE}
ON <table_name>
[REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]]
[FOR EACH ROW [WHEN (<trigger_condition>)]]
<trigger_body>
Trigger is created for each source table in OLTP DB and stores insert/update/delete changes in a “log/journal table”
• trigger body contains insert statements into log/journal table
TRIGGER-BASED
Data Warehouse / DHBWDaimler TSS 30
Log-based discovery
Also often referenced as CDC (Change Data Capture)
Usage of database transaction logs to determine changes • DBMSs write transaction logs in order to be able to undo partially executed
transactions
• This information can be used to determine all changes
• Log reader identifies insert, update, delete, truncates and writes the changes as inserts into staging layer
Transaction Log files can be transferred to other systems to avoid additional load on source systems
LOG-BASED
Data Warehouse / DHBWDaimler TSS 31
LOG-BASED (SAMPLE PRODUCT ARCHITECTURE IIDR)
Data Warehouse / DHBWDaimler TSS 32
Fron
tend
StandardReports
AdHocReports
IIDRReplEngine
Source
DatastoreSource
OLTPDB
IIDR ReplEngineDWH
DatastoreDWH
DWH DB
Staging Layer
Core Layer
Mart Layer
TransactionLogs
Replication techniques
Data replication
• Target tables not necessarily on local system
• Uses typically Transaction Logs
• Log reader identifies insert, update, delete, truncates and writes the changes into replicated tables (insert remains insert, update remains update, etc)
• Useful for 1:1 copies (e.g. ODS, Operational Data Store) but still challenge to detect changes for loading the data mart
REPLICATION-BASED
Data Warehouse / DHBWDaimler TSS 33
REPLICATION-BASED (SAMPLE PRODUCT ARCHITECTUREIIDR)
Data Warehouse / DHBWDaimler TSS 34
Fron
tend
StandardReports
AdHocReports
IIDRReplEngine
Source
DatastoreSource
OLTPDB
IIDR ReplEngineDWH
DatastoreDWH
DWH DB
Staging Layer
Core Layer
Mart Layer
TransactionLogs
Timestamp-based discovery
• Every data item in a table is associated with timestamp information about its validity period
• Changed data can be determined from this timestamp information
TIMESTAMP-BASED
Data Warehouse / DHBWDaimler TSS 35
Sample customer table in OLTP
• Each table gets Change timestamp
• Delta process reads latest data only (e.g. ChangeTimestamp >= <yesterday>)
• Problem: it is not possible to identify deleted rows
TIMESTAMP-BASED
Data Warehouse / DHBWDaimler TSS 36
CustomerID Name Department Change Timestamp
1 Miller DWH 15.01.2015 17:00:01
2 Powell DB 22.03.2016 08:30:22
Data comparison
Comparison of snapshots of the operational data at different points in time• Compute difference between two latest snapshots
• E.g. unload all data from a table into a file and diff newest file content with latest file content
Can be very complex
Sometimes the only possibility, for instance for legacy applications
High performance impact on source
SNAPSHOT-BASED
Data Warehouse / DHBWDaimler TSS 37
MONITORING TECHNIQUES COMPARISON
Data Warehouse / DHBWDaimler TSS 38
Trigger-based Replication techniques
Log-based discovery
Timestamp-based discovery
Snapshot-based discovery
Performanceimpact on source system
Medium Low Low Medium High
Performanceimpact on target system
Low Low Low Low High
Load on network Low Low Low Low High
Data loss if nologgingoperations
No Yes Yes No No
MONITORING TECHNIQUES COMPARISON
Data Warehouse / DHBWDaimler TSS 39
Trigger-based Replication techniques
Log-based discovery
Timestamp-based discovery
Snapshot-based discovery
Identify DELETE operations
Yes Yes Yes No Yes
Identify ALLchanges (changes between extractions)
Yes Yes Yes No No
Direct Access
• Source writes data into target or
• Target reads data from source
• Security concerns
• High coupling / dependencies
DATA TRANSPORT – DIRECT ACCESS
Data Warehouse / DHBWDaimler TSS 40
Source Target
File transfer (or other transport medium)
• csv, json, xml, binary, etc
• Transfer data by scp, rfts (reliable file transfer system), ESB (enterprise service bus), SOA (service oriented architecture), etc
• Often high amounts of data, therefore bulk transfer of compressed data most widely used
• Better decoupling of source and target
DATA TRANSPORT – FILE TRANSFER
Data Warehouse / DHBWDaimler TSS 41
Source Targetfiles
Extraction intervals
• Periodically – in regular intervals
• Every day, week, etc.
• Instantly / Continuous
• Every change is directly propagated into the data warehouse
• „real time data warehouse“
• Depends on the requirements on timeliness of the data warehouse data
EXTRACTION INTERVALS
Data Warehouse / DHBWDaimler TSS 42
Triggered by a specific request
• Addition of a new product
• Query which involves more recent data
Triggered by specific events
• Number of changes in operational data exceeds threshold
EXTRACTION INTERVALS
Data Warehouse / DHBWDaimler TSS 43
DATA QUALITY
Data Warehouse / DHBWDaimler TSS 44
Source: https://twitter.com/markmadsen/status/1059579065164738560?s=21
• Profile Existing Data Sources, Extracted Data
• Analyze data structure, content, and quality
• Find data relationships across systems
• Often badly documented or missing foreign keys
• Uncover data issues that can affect subsequent transformation steps
• Missing values
• Duplicates
• Inconsistencies
PREREQUISITE OF TRANSFORMATION: UNDERSTANDING THE DATA
Data Warehouse / DHBWDaimler TSS 45
DATA PYRAMID AND DATA QUALITY
Data Warehouse / DHBWDaimler TSS 46
Source: By Matthew.viel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49310779 LinkedIn 11/2017: https://www.linkedin.com/feed/update/urn:li:activity:6334062387355746304
DATA QUALITY ISSUES
Data Warehouse / DHBWDaimler TSS 47
CustomerNo Name Birthdate Age Gender Zip code
1 Miller, Tom 33.01.2001 15 M NULL
1 John Mayor 15.01.2001 15 M 98144
2 Mrs. Bush 31.10.1988 22 Q 00000
3 Martin 31.10.1988 22 M 75890
PK / Unique Key violated Data not uniform Not valid
Inconsistent Wrong value
Unknown / missing
FK violated
DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THESOURCE RDBMS
Data Warehouse / DHBWDaimler TSS 48
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
DATA QUALITY ISSUES AND POSSIBLE SOLUTIONS IN THESOURCE RDBMS
Data Warehouse / DHBWDaimler TSS 49
Issue Solution
Wrong data e.g. 31.02.2016 Proper data type definition
Wrong values, e.g. number out of range Check constraint
Missing values NOT NULL constraint
Violated references FOREIGN KEY constraint
Duplicates PRIMARY or UNIQUE KEY constraint
Inconsistent data ACID transactions, business logic, additional checks
Correcting the data
• Automatically during ETL• E.g., address of a customer if a correct reference table exists
• Manually after ETL is finished• ETL stored bad data in error log tables or files
• ETL flags bad data (e.g. invalid)
DATA QUALITY ISSUES: WORKAROUNDS IN DWH
Data Warehouse / DHBWDaimler TSS 50
Correcting the data
• In the source systems • Common master data management across all operational applications
• Dedicated systems are “master” of e.g. customer data
• Correcting the data at the source is best approach but slow and often not feasible
DATA QUALITY ISSUES: CORRECT DATA IN THE SOURCE
Data Warehouse / DHBWDaimler TSS 51
• Column is null
• Reject data
• Use default values
• Missing values can represent
• an unknown value Iike date of birth of a customer
• a missing value like engine_id for a car (logical not null constraint)
• Dimension tables can include some dummy values:
DATA QUALITY ISSUES: MISSING DATA
Data Warehouse / DHBWDaimler TSS 52
DimensionTable_X Description
-1 Unknown
-2 Missing
• Data is inaccuratee.g. wrong date 32.12.2015 or wrong number 55U
• Reject data
• Replace with value that represents „Invalid“
• Dimension tables can include some dummy values:
DATA QUALITY ISSUES: MISSING DATA
Data Warehouse / DHBWDaimler TSS 53
DimensionTable_X Description
-1 Unknown
-2 Missing
-3 Invalid
• Data has conflicts, e.g. wrong postal code 80995 Stuttgart
• Reject data
• Replace one of the values with a value that represents „Invalid“ or with corrected valueWhich value to replace? Rules necessary
DATA QUALITY ISSUES: CONFLICTING DATA
Data Warehouse / DHBWDaimler TSS 54
• Data is inconsistent, e.g. unlikely high price for a product
• Can be discovered by statistical and data mining methods
DATA QUALITY ISSUES: INCONSISTENT DATA
Data Warehouse / DHBWDaimler TSS 55
• Data is duplicated, e.g. „Martin Miller” vs “Miller, Martin” vs “M.Miller”
• Multiple representations for one entity • Different keys
• Different encodings
• Duplicate detection can be very difficult / tricky
• Products are available for e.g. address duplicate detection address validation (Kingstreet = does this address actually exist?)address harmonization (Kingstr, Kingstreet, King Street, etc)
• Standardize / Harmonize data during ETL flow: “unification”
DATA QUALITY ISSUES: DUPLICATES
Data Warehouse / DHBWDaimler TSS 56
• Unification of data types
• Character string → date „20.01.2006“ → 20.01.2006
• Character string → number „12345“ → 12345
• Unification of encodings
• For instance for gender F and M
• Lookup-tables contain the mapping from old to new encodings
• Combination of different attributes to one attribute
• day, month, year → date
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 57
• Split of one attribute into two or more
• Name → first name, last name (“Herr Prof. Dr. Hans M. vom und zum Stein”)
• Unification of names can become very challenging “Herr Prof. Dr. Hans M. vomund zum Stein” or “Werner Martin” or “Mariae Gloria … Wilhelmine HubertaGräfin von Schönburg-Glauchau“
• Product name - „Cola, 0.33 l“→ Product short name - „Cola“, size in liters - 0.33
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 58
• Unification of dates and timestamps
• Rules for representing incomplete date information If only month and year are known
• Dates and timestamps with regard to one specific timezoneImportant for multi-national organizationsUTC Coordinated Universal Time without daylight saving zone
• What can happen if clock is changed to winter time if no UTC is used?- Update arrives at 02:15 in staging layer (CDC / log-based monitor)- Clock is changed to winter time: -1h- Update of the same row arrives at 02:10 in staging layer (CDC / log-based)- How can batch load running the next night discover which update is the most recent one?
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 59
• Computation of derived values
• Profit = sales price – purchase price Without clear definition, different interpretations possible
• Net or gross sales price?
• Net or gross purchase price?
• Aggregations
• Revenue of the year computed from revenues of the dayWithout clear definition, different interpretations possible
• Calendar year?
• Fiscal year?
TRANSFORM - UNIFICATION OF DATA
Data Warehouse / DHBWDaimler TSS 60
• Efficient load operations are important
• bulk load: Single row processing vs set based processing
• Online load • Data warehouse (especially Data Mart) is still accessible
• Offline load
• Data warehouse (especially Data Mart) is offline
• For updates that require the recomputation of a cube
• Offline load is often a Tool limit because the Tool locks data structures. But offline load could be faster.
LOAD
Data Warehouse / DHBWDaimler TSS 61
• Specific Bulk load operations provided by RDBMS, e.g. External tables in Oracle or LOAD command in DB2
• Single row vs set based processing
BULK PROCESSING
Data Warehouse / DHBWDaimler TSS 62
Single row processing Set based processing
Cursor curs = SELECT * FROM <source>WHILE NOT EOF(curs)
FETCH NEXT ROW INTO myRoW;INSERT INTO <target> VALUES(myRow);
LOOP
INSERT INTO <target>SELECT * from <source>
Error handling easy All or nothing if there are errors
Slow for high amounts of data Performs well for small and high amounts of data
More coding Less code = less errors
ETL-JOB PARALLELISM FOR LOADING DATA INTO CORE WAREHOUSE LAYER
Data Warehouse / DHBWDaimler TSS 63
HU
B lo
aded
LIN
K u
nd
HU
B-
SAT
load
ed
LIN
K-S
AT
load
ed
Dat
a V
ault
Load
Cla
ssic
alLo
ad
?
? ?
Integration of new JobsTime Windows for Loads, e.g 00:00-06:00
• Complex
• Many dependencies
• Many sequential jobs
• Systematic / Methodic
• Few, well defined dependencies
• Massive parallel
EXAMPLE FOR DATA INTEGRATION IN DATA VAULT 2.0 ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 64
Source: Hans Hultgren: Modeling the agile Data Warehouse with Data Vault, New Hamilton 2012, p. 224
Hard Rules only
Soft Rules
Raw Data Vault
BusinessData Vault
ETL (E)T(L) ETL
ETL,
„M
on
ito
rin
g“
Draw a flow diagram how to load a HUB, LINK and SAT table and describe the SQL statements
EXERCISE: LOAD DATA VAULT TABLE
Data Warehouse / DHBWDaimler TSS 65
EXERCISE: LOAD HUB TABLE
Data Warehouse / DHBWDaimler TSS 66
Source data exist
Load distinctbusiness keys
Doesbusiness
Key exist in HUB?
Insert row intoHUB
Conflict if PK HashKeycollision!
no
Rejectdata
Data loaded intoHUB
yes
INSERT INTO core.fahrzeug (vehicle_hk, fin, loaddate, recordsource)
SELECT DISTINCT f.fahrzeug_hashkey
, f.fin_bk
, f.loaddate
, f.recordsource
FROM staging.fahrzeugdaten f
WHERE f.fin_bk NOT in (SELECT fin FROM core.hub_fahrzeug)
AND f.loaddate = <date to load>;
EXERCISE: LOAD HUB TABLE
Data Warehouse / DHBWDaimler TSS 67
EXERCISE: LOAD LINK TABLE
Data Warehouse / DHBWDaimler TSS 68
Source data exist
Load distinctbusiness keys
Does Hash Key
relationshipexist in HUB?
Insert row intoLINK
Conflict if PK HashKeycollision!
no
Rejectdata
Data loaded intoLINK
yes
INSERT INTO core.link_verbaut (verbaut_hk, motor_hk, vehicle_hk, loaddate, recordsource)
SELECT DISTINCT h.verbaut_hk
, f.motor_hashkey
, f.fahrzeug_hashkey
, f.loaddate
, f.recordsource
FROM staging.fahrzeugdaten f
WHERE (f.motor_hashkey, f.fahrzeug_hashkey) NOT in (SELECT motor_hk, vehicle_hk FROM core.link_verbaut v)
AND loaddate = <date to load>;
EXERCISE: LOAD LINK TABLE
Data Warehouse / DHBWDaimler TSS 69
EXERCISE: LOAD SAT TABLE
Data Warehouse / DHBWDaimler TSS 70
Source data exist
Load distinctsource
data
MD5-HASH Diff
identical?
Insert row intoSAT
no
Rejectdata
Data loaded intoSAT
yes
Load current/
latest rowfrom SAT
table
INSERT INTO core.sat_fahrzeug_text (vehicle_hk, loaddate, recordsource, md5_hash, codeleiste, kommentar)
SELECT DISTINCT f.fahrzeug_hashkey
, f.loaddate
, f.recordsource
, f.md5hash
, f.codeleiste
, f.kommentar
FROM staging.fahrzeugdaten f
LEFT OUTER JOIN (select s.vehicle_hk, s.md5_hash from s_fahrzeug s JOIN (select i.VEHICLE_HK, max(i.loaddate) as loaddate froms_fahrzeug i GROUP BY i.VEHICLE_HK) m
ON s.vehicle_hk = m.vehicle_hk AND s.loaddate = m.loaddate) k ON f.fahrzeug_hashkey = k.vehicle_hk
WHERE (k.md5_hash is null OR f.md5hash <> k.md5_hash)
AND f.loaddate = <date to load>;
EXERCISE: LOAD SAT TABLE
Data Warehouse / DHBWDaimler TSS 71
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 72
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging Layer(Input Layer)
OLTP
OLTP
Core Warehouse
Layer(Storage
Layer)
Mart Layer(Output Layer)
(Reporting Layer)
Integration Layer
(Cleansing Layer)
Aggregation Layer
Metadata Management
Security
DWH Manager incl. Monitor
? ? ? ?
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 73
THANK YOU