unit 4 etl(extract transform load). etl the extract-transform-load (etl) system is the foundation of...

71
Unit 4 ETL(Extract Transform Load)

Upload: rosemary-louise-underwood

Post on 26-Dec-2015

283 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Unit 4ETL(Extract Transform

Load)

Page 2: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

ETL

The Extract-Transform-Load (ETL) system is the foundation of the data warehouse.

A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data

Delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions.

Page 3: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

ETL

The ETL system makes or breaks the data warehouse.Building the ETL system is a back room activity that is

not very visible to end users, it easily consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse.

ETL system:1. Removes mistakes and corrects missing data 2. Provides documented measures of confidence in data3. Captures the flow of transactional data for safekeeping4. Adjusts data from multiple sources to be used together5. Structures data to be usable by end-user tools

Page 4: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Flow in ETL

Page 5: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system
Page 6: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Quality

Page 7: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Introduction•World of heterogeneity.•Different technologies.•Different platforms.•Large amount of data being generated everyday in all sorts of organizations and Enterprises.•Problems with data.

Page 8: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Problems

•Duplicated , inconsistent , ambiguous, incomplete.

•So there is a need to collect data in one place and clean up the data.

Page 9: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Why data quality matters?

Good data is your most valuable asset, and bad data can seriously harm your business and credibility…

1.What have you missed?2.When things go wrong.3.Making confident decisions.

Page 10: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

What is data quality?

•Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context.•It is described by several dimensions like-•Correctness / Accuracy : Accuracy of data is the degree to which the captured data correctly describes the real world entity.•Consistency: This is about the single version of truth. Consistency means data throughout the enterprise should be sync with each other.

Page 11: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Contd…

•Completeness: It is the extent to which the expected attributes of data are provided.•Timeliness: Right data to the right person at the right time is important for business. Metadata: Data about data.

Page 12: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Maintenance of data quality

Data quality results from the process of going through the data and scrubbing it, standardizing it, and de duplicating records, as well as doing some of the data enrichment.1. Maintain complete data.2. Clean up your data by standardizing it using rules.3. Use fancy algorithms to detect duplicates. Eg: ICS and Informatics Computer System.4. Avoid entry of duplicate leads and contacts.5. Merge existing duplicate records.6. Use roles for security.

Page 13: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Bill no CustomerName SocialSecurityNumber

101 Mr. Aleck Stevenson

ADWPS10017

Bill no CustomerName SocialSecurityNumber

205 Mr. S Aleck ADWPS10017

Bill no CustomerName SocialSecurityNumber

314 Mr. Stevenson Aleck

ADWPS10017

Bill no CustomerName SocialSecurityNumber

316 Mr. Alec Stevenson

ADWPS10017

Invoice 3

Invoice 2

Invoice 4

Invoice 1

Inconsistent data before cleaning up

Page 14: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Bill no CustomerName SocialSecurityNumber

205 Mr. Aleck Stevenson

ADWPS10017

Bill no CustomerName SocialSecurityNumber

101 Mr. Aleck Stevenson

ADWPS10017

Bill no CustomerName SocialSecurityNumber

314 Mr. Aleck Stevenson

ADWPS10017

Bill no CustomerName SocialSecurityNumber

316 Mr. Aleck Stevenson

ADWPS10017

Invoice 1

Invoice 4

Invoice 3

Invoice 2

Consistent data after cleaning

Page 15: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Profiling

Page 16: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

ContextIn process of data warehouse design, many database professionals face situations like: 1.Several data inconsistencies in source, like

missing records or NULL values.2.Or, column they chose to be the primary key

column is not unique throughout the table.3.Or, schema design is not coherent to the end

user requirement.4.Or, any other concern with the data, that

must have been fixed right at the beginning.

Page 17: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

• To fix such data quality issues would

mean making changes in ETL data flow packages., cleaning the identified inconsistencies etc.

• This in turn will lead to a lot of re-work to be done.

• Re-work will mean added costs to the company, both in terms of time and effort.

So, what one would do in such a case?

Page 18: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Solution

• Instead of a solution to the problem, it would be better to catch it right at the start before it becomes a problem.

• Use data profiling software

Page 19: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

What is data profiling ?• It is the process of statistically examining

and analyzing the content in a data source, and hence collecting information about the data.

• It consists of techniques used to analyze the data we have for accuracy and completeness.

1. Data profiling helps us make a thorough assessment of data quality.

2. It assists the discovery of anomalies in data.

3. It helps us understand content, structure, relationships, etc. about the data in the data source we are analyzing.

Page 20: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Contd…

4. It helps us know whether the existing data can be applied to other areas or purposes.

5. It helps us understand the various issues/challenges we may face in a database project much before the actual work begins. This enables us to make early decisions and act accordingly.

6. It is also used to assess and validate metadata.

Page 21: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

When and how to conduct data profiling?

Generally, data profiling is conducted in two ways:

1.Writing SQL queries on sample data extracts put into a database.

2.Using data profiling tools.

Page 22: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

When to conduct Data Profiling?

-> At the discovery/requirements gathering phase

-> Just before the dimensional modeling process

-> During ETL package design.

Page 23: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

How to conduct Data Profiling?

• Data profiling involves statistical analysis of the data at source and the data being loaded, as well as analysis of metadata.

• These statistics may be used for various analysis purposes.:

Data quality: Analyze the quality of data at the data source.

NULL values: Look out for the number of NULL values in an attribute.

Page 24: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Candidate keys: Analysis of the extent to which certain columns are distinct will give developer useful information w. r. t. selection of candidate keys.

Primary key selection: To check whether the candidate key column does not violate the basic requirements of not having NULL values or duplicate values.

Page 25: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Empty string values: A string column may contain NULL or even empty sting values that may create problems later.

String length: An analysis of largest and shortest possible length as well as the average string length of a sting-type column can help us decide what data type would be most suitable for the said column.

Page 26: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Identification of cardinality: The cardinality relationships are important for inner and outer join considerations with regard to several BI tools.

Data format: Sometimes, the format in which certain data is written in some columns may or may not be user-friendly.

Page 27: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Common Data Profiling SoftwareMost of the data-integration/analysis soft-wares have data profiling built into them. Alternatively, various independent data profiling tools are also available. Some popular ones are:• Trillium Enterprise Data quality• Datiris Profiler• Talend Data Profiler• IBM Infosphere Information Analyzer• SSIS Data Profiling Task• Oracle Warehouse Builder

Page 28: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Profiling

Elimination of some input fields completely Flagging of missing data and generation of

special surrogate keys Best-guess automatic replacement of

corrupted values Human intervention at the record level Development of a full-blown normalized

representation of the data

Page 29: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Staging

Accessible only to experienced data integration professionals.

It is a back-room facility, completely off limits to end users, where the data is placed after it is extracted from the source systems, cleansed, manipulated, and prepared to be loaded to the presentation layer of the data warehouse.

Any metadata generated by the ETL process that is useful to end users must come out of the back room and be offered in the presentation area of the data warehouse.

Page 30: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

The Four Staging Steps of a Data Warehouse.

Page 31: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Extraction

The integration of all of the disparate systems across the enterprise is the real challenge to getting the data warehouse to a state where it is usable

Data is extracted from heterogeneous data sourcesEach data source has its distinct set of

characteristics that need to be managed and integrated into the ETL system in order to effectively extract data.

Q. Write design steps of dimensional modeling.

Page 32: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

ETL process needs to effectively integrate systems that have different: DBMS Operating Systems Hardware Communication protocols

Need to have a logical data map before the physical data can be transformed

The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system usually presented in a table or spreadsheet

Extraction

Page 33: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Extract

Before you begin building your extract systems, you need a logical data map that documents the relationship between original source fields and final destination fields in the tables you deliver to the front room.

This document ties the very beginning of the ETL system to the very end.

Page 34: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Logical data map

1. Have a plan -foundation of the metadata2. Identify data source candidates- identify

the likely candidate data sources you believe will support the decisions needed by the business community

3. Analyze source systems with a data-profiling tool - detected data anomaly must be documented, and best efforts must be made to apply appropriate business rules to rectify data before it is loaded into the data warehouse.

Page 35: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

4. Receive walk-though of data lineage and business rules - target data model is understood, the data warehouse architect and business analyst must walk the ETL architect and developers through the data lineage and business rules for extracting, transforming, and loading the subject areas of the data warehouse

Page 36: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

5. Receive walk-through of data warehouse data model.

The ETL team must completely understand the physical data model of the data warehouse. This understanding includes dimensional modeling concepts. Understanding the mappings on a table-by-table basis is not good enough.

Page 37: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

6. Validate calculations and formulas. It is helpful to make sure the calculations are

correct before you spend time coding the wrong algorithms in your ETL process.

Page 38: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Components of the Logical Data Map

Presented in a table or spreadsheet format and includes the following specific components:

Target table name. The physical name of the table as it appears in the data

warehouse Target column name. The name of the column in the data warehouse table Table type. Indicates if the table is a fact, dimension, or subdimension SCD (slowly changing dimension) type. For dimensions, this component indicates a Type-1, -2, or -3

slowly changing dimension approach. This indicator can vary for each column in the dimension.

Page 39: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Components of the Logical Data Map

Source database. The name of the instance of the database where the source data

resides – -connect string required to connect to the database. - name of a file it appears in the file system Source table name. The name of the table where the source data originates. list all tables

required to populate the relative table in the target data warehouse. Source column name. The column or columns necessary to populate the target. list all of the

columns required to load the target column. The associations of the source columns are documented in the transformation section.

Transformation. The exact manipulation required of the source data so it corresponds

to the expected format of the target. This component is usually notated in SQL or pseudo-code.

Page 40: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Using Tools for the Logical Data Map

Page 41: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes

The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts.

The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process

Target Source Transformation

Table Name

Column Name

Data Type

Table Name

Column Name

Data Type

Page 42: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL.

The analysis of the source system is usually broken into two major phases: The data discovery phase The anomaly detection phase

Page 43: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Extraction - Data Discovery Phase

Data Discovery Phasekey criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it

Once you understand what the target needs to look like, you need to identify and examine the data sources

Page 44: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Discovery Phase

It is up to the ETL team to drill down further into the data requirements to determine each and every source system, table, and attribute required to load the data warehouseCollecting and Documenting Source Systems Keeping track of source systems Determining the System of Record - Point of

originating of data Definition of the system-of-record is important

because in most enterprises data is stored redundantly across many different systems.

Enterprises do this to make nonintegrated systems share data. It is very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data

Page 45: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Content Analysis - Extraction

Understanding the content of the data is crucial for determining the best approach for retrieval- NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss! Check for NULL values in every foreign key in the source database. When NULL values are present, you must outer join the tables- Dates in nondate fields. Dates are very peculiar elements because they are the only logical elements that can come in various formatsFortunately, most database systems support most of the various formats for display purposes but store them in a single standard format

Page 46: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

During the initial load, capturing changes to data content in the source data is unimportant because you are most likely extracting the entire data source or a potion of it from a predetermined point in time.

Later the ability to capture data changes in the source system instantly becomes priority

The ETL team is responsible for capturing data-content changes during the incremental load.

Page 47: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Determining Changed Data

Audit Columns : Used by DB and updated by triggers

Audit columns are appended to the end of each table to store the date and time a record was added or modified

You must analyze and test each of the columns to ensure that it is a reliable source to indicate changed data. If you find any NULL values, you must find an alternative approach for detecting change using outer joins

Page 48: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Process of EliminationProcess of elimination preserves exactly one copy

of each previous extraction in the staging area for future use.

During the next run, the process takes the entire source table(s) into the staging area and makes a comparison against the retained data from the last process.

Only differences (deltas) are sent to the data warehouse.

Not the most efficient technique, but most reliable for capturing changed data

Determining Changed Data

Page 49: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Initial and Incremental LoadsCreate two tables: previous load and current load.The initial process bulk loads into the current load

table. Since change detection is irrelevant during the initial load, the data continues on to be transformed and loaded into the ultimate target fact table.

When the process is complete, it drops the previous load table, renames the current load table to previous load, and creates an empty current load table.

The next time the load process is run, the current load table is populated.

Select the current load table MINUS the previous load table. Transform and load the result set into the data warehouse.

Determining Changed Data

Page 50: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Transformation

Page 51: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Transformation

Main step where the ETL adds valueActually changes data and provides guidance

whether data can be used for its intended purposes

Performed in staging area

Page 52: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Data Quality paradigmCorrectUnambiguousConsistentCompleteData quality checks are run at 2 places -

after extraction and after cleaning and confirming additional check are run at this point

Transformation

Page 53: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Transformation - Cleaning Data

Anomaly Detection Data sampling – count of the rows for a

department columnColumn Property Enforcement

Null Values in reqd columns Numeric values that fall outside of expected

high and lows Cols whose lengths are exceptionally short/long Cols with certain values outside of discrete

valid value sets Adherence to a reqd pattern/ member of a set

of pattern

Page 54: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system
Page 55: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Transformation - Confirming

Structure Enforcement Tables have proper primary and foreign keys Obey referential integrity

Data and Rule value enforcement Simple business rules Logical data checks

Page 56: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Staged DataCleaning

And Confirming

Fatal Errors

Stop

Loading

Yes

No

Page 57: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Loading

LOADING DIMENSIONSLOADING FACTS

Page 58: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Loading Dimensions

The primary key is a single field containing meaningless unique integer – Surrogate Keys

The DW owns these keys and never allows any other entity to assign them

De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.

Should possess one or more other fields that compose the natural key of the dimension

Page 59: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system
Page 60: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.

Creating and assigning the surrogate keys occur in this module.

The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.

Page 61: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Loading dimensions

When DW receives notification that an existing row in dimension has changed it gives out 3 types of responsesType 1Type 2Type 3

Page 62: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Type 1 Dimension

Page 63: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Type 2 Dimension

Page 64: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Type 3 Dimensions

Page 65: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Loading facts

Facts• Fact tables hold the measurements of an

enterprise. • The relationship between fact tables and

measurements is extremely simple. • If a measurement exists, it can be modeled as

a fact table row. If a fact table row exists, it is a measurement

Page 66: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Key Building Process - Facts

When building a fact table, the final ETL step is converting the natural keys in the new input records into the correct, contemporary surrogate keys

ETL maintains a special surrogate key lookup table for each dimension. This table is updated whenever a new dimension entity is created and whenever a Type 2 change occurs on an existing dimension entity

All of the required lookup tables should be pinned in memory so that they can be randomly accessed as each incoming fact record presents its natural keys. This is one of the reasons for making the lookup tables separate from the original data warehouse dimension tables.

Page 67: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Key Building Process

Page 68: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Loading Fact Tables

Managing Indexes Performance Killers at load time Drop all indexes in pre-load time Segregate Updates from inserts Load updates Rebuild indexes

Page 69: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Managing Partitions Partitions allow a table (and its indexes) to be physically

divided into minitables for administrative purposes and to improve query performance

The most common partitioning strategy on fact tables is to partition the table by the date key. Because the date dimension is preloaded and static

Need to partition the fact table on the key that joins to the date dimension for the optimizer to recognize the constraint.

The ETL team must be advised of any table partitions that need to be maintained.

Page 70: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

Maintaining the Rollback LogThe rollback log, also known as the redo log, is

invaluable in transaction (OLTP) systems. But in a data warehouse environment where all transactions are managed by the ETL process, the rollback log is a superfluous feature that must be dealt with to achieve optimal load performance. Reasons why the data warehouse does not need rollback logging are: All data is entered by a managed process—the ETL

system. Data is loaded in bulk. Data can easily be reloaded if a load process fails. Each database management system has different logging

features and manages its rollback log differently

Page 71: Unit 4 ETL(Extract Transform Load). ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system

References

“The Data Warehouse ETL Toolkit” by Ralph Kimball

www.slideshare.com