introduction to etl process

27
Introduction to ETL process Omid Vahdaty

Upload: omid-vahdaty

Post on 16-Apr-2017

143 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Introduction to ETL process

Introduction to ETL processOmid Vahdaty

Page 2: Introduction to ETL process

Assuming●ETL = Extract transform load

●SQL knowledge

●DW concepts

Page 3: Introduction to ETL process

Concepts●Dimensions

●Facts

●Aggregate facts

●Data mart

Page 4: Introduction to ETL process

BI vs ETL?●ETL is from DB to DB

○ Tools: Talend ○ Informatica○ SAP BODS ○ Oracle DATA integrator○ Microsoft SSIS

●BI is ○ AD hoc queries○ Dash boarding○ Tools: SAP BO , IBM cognos, Jasper soft , Tablue , Oracle BI.

Page 5: Introduction to ETL process

ETL●Extract data from DB via jobs.

●Transform - ○ change the format of data before loading.

○ Cleaning the data

○ Remove bad data or fix it.

○ Data integrity

●Load - simply load the data.

Page 6: Introduction to ETL process

ETL Tool layers1.Staging - where extracted data is saved

2.Integration - process of data is loaded

3.Access - where the data will be queried,.

Page 7: Introduction to ETL process

ETL tasks● Understand the data to be used for reporting

● Review the Data Model

● Source to target mapping

● Data checks on source data

● Packages and schema validation

● Data verification in the target system

● Verification of data transformation calculations and aggregation rules

● Sample data comparison between the source and the target system

● Data integrity and quality checks in the target system

● Performance testing on data

Page 8: Introduction to ETL process

ETL testing

Validation of data movement from the source to the target system.

Verification of data count in the source and the target system.

Verifying data extraction, transformation as per requirement and expectation.

Verifying if table relations – joins and keys – are preserved during the transformation.

Page 9: Introduction to ETL process

Database testing Verifying if primary and foreign keys are maintained.

Verifying if the columns in a table have valid data values.

Verifying data accuracy in columns. Example − Number of months column shouldn’t have a value greater than 12.

Verifying missing data in columns. Check if there are null columns which actually should have a valid value.

Page 10: Introduction to ETL process

ETL testing categories●Source 2 target

○ count testing

○ data validation testing (duplicates? Data integrity )

○ Data transformation

○ Constraint testing (null, unique, keys, ranges)

● Change /delta testing

●End Report test

Page 11: Introduction to ETL process

ETL Challenges●Data loss during ETL●Incorrect, incomplete or duplicate data.●DW system contains historical data, so the data volume is too large and extremely complex to perform ETL testing in the target system.●Performance●Checking Critical columns●Support Date time format and time zone conversation ●Supported text encoding●Ignoring headers in CSV●Incorrect column number due to separator usage in text field

Page 12: Introduction to ETL process

Extract validation● Count check

● Reconcile records with the source data

● Data type check

● Ensure no spam data loaded

● Remove duplicate data

● Check all the keys are in place

Page 13: Introduction to ETL process

Transform validation● Data threshold validation check, for example, age value shouldn’t be more than 100.

● Record count check, before and after the transformation logic applied.

● Data flow validation from the staging area to the intermediate tables.

● Surrogate key check.

Page 14: Introduction to ETL process

Load verificationRecord count check from the intermediate table to the target system.

Ensure the key field data is not missing or Null.

Check if the aggregate values and calculated measures are loaded in the fact tables.

Check modeling views based on the target tables.

Check if CDC has been applied on the incremental load table.

Data check in dimension table and history table check.

Check the BI reports based on the loaded fact and dimension table and as per the expected results.

Page 15: Introduction to ETL process

Data duplication validation●Example: Select Cust_Id, Cust_NAME, Quantity, COUNT (*)

FROM Customer GROUP BY Cust_Id, Cust_NAME, Quantity HAVING COUNT (*) >1;

●Reasons for duplicate data:

○ If no primary key is defined, then duplicate values may come.

○ Due to incorrect mapping or environmental issues.

○ Manual errors while transferring data from the source to the target system.

Page 16: Introduction to ETL process

Data Integrity testing● number check,

● date check,

● null check,

● precision check

● invalid characters,

● incorrect upper/lower case order,

Page 17: Introduction to ETL process

Detailed use cases for testing: https://www.tutorialspoint.com/etl_testing/etl_testing_scenarios.htm

Page 18: Introduction to ETL process

Best practices●Analyze data

●Fix bad data in the source

●Find a compatible ETL tool

●Monitor ETL job

●Apply Incremental ETL techniques when timestamp available.

Page 20: Introduction to ETL process

Courses & books●http://www.robertomarchetto.com/talend_data_integration_free_boo

k

●Basic Time series ETL by Omid: https://docs.google.com/document/d/1KoFMeFtxXDGiZIswcGS1o8zmp2ZlPGbX0yj6ceQ_zlU/edit?usp=sharing

Page 21: Introduction to ETL process

Exercise: Original table, create, drop, and add new data.

/*drop table t;create table t( i int IDENTITY(1,1) NOT NULL, d datetime );*/

-- assuming unique datevalues in t, not null

insert into t (d) values (getdate());

SELECT * from t where d>=DATEADD(minute, -1, GETDATE());

Page 22: Introduction to ETL process

Exercise: Staging tableinsert into t_staging (i,d) SELECT * from t where d>=DATEADD(minute, -1, GETDATE()) ;

insert into t_presentation (d,i) select distinct(d),i from t_staging order by d desc;

truncate table t_staging;

Page 23: Introduction to ETL process

Exercise: presentationselect count (*) from t_presentation;select count (*) from t;select * from t_presentation order by d desc

Page 24: Introduction to ETL process

Talent

Page 25: Introduction to ETL process
Page 26: Introduction to ETL process
Page 27: Introduction to ETL process

Talend sourcesAdd mssql jdbc :https://www.talendforge.org/forum/viewtopic.php?id=54068

How to connect 2 components:https://www.talendforge.org/forum/viewtopic.php?id=6493

How to Create loop of a job (FYI, right click on project name, create project. Under Job design - far left, upper corner)https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide62EN/tLoop

Running in parallel:https://www.talendbyexample.com/talend-job-parallelization-reference.html

Running Several Queries for ETL such insert into, truncatehttp://www.vikramtakkar.com/2013/05/example-to-execute-multiple-sql-queries.html