a courseware on etl process

A COURSEWARE ON ETL PROCESS

Nithin VijayendraB.E, Visveswaraiah Technological University, Karnataka, India, 2005

PROJECT

Submitted in partial satisfaction ofthe requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

at

CALIFORNIA STATE UNIVERSITY, SACRAMENTO

FALL2010


A Project

by

Nithin Vijayendra

Approved by:

__________________________________, Committee ChairDr. Meiliu Lu

__________________________________, Second ReaderDr. Chung-E Wang

____________________________Date

ii

Student: Nithin Vijayendra

I certify that this student has met the requirements for format contained in the University format

manual, and that this project is suitable for shelving in the Library and credit is to be awarded for

the Project.

__________________________, Graduate Coordinator ________________Dr. Nik Faroughi Date

Department of Computer Science

iii

Abstract

of


by

Nithin Vijayendra

Extract, Transform and Load (ETL) is a fundamental process used to populate a

data warehouse. It involves extracting data from various sources, transforming the data

according to business requirements and loading them into target data structures. Inside

the transform phase a well-designed ETL system should also enforce data quality, data

consistency and conforms data so that data from various source systems can be

integrated. Once the data is loaded into target systems in a presentation-ready format, the

end users can run queries against them to generate reports which help them make better

business decisions. Even though ETL process consumes roughly 70% of computing

resources they are hardly visible to the end users [5].

The objective of this project is to create a website which contains courseware on

ETL process and a web based ETL tool. The website, containing the ETL courseware,

can be accessed by anyone with internet access. This will be helpful to a wide range of

audience from beginners to experienced users. The website is created using technologies

HTML, PHP, Korn shell scripts and MySQL.

iv

The ETL tool is web based and anyone with internet access can use this tool for

free. However guests have limited access to this tool and registered users have complete

access. Using this tool, data can be extracted from text files and MySQL tables,

combined together and loaded into MySQL tables. Before loading the data into target

MySQL tables, various transformations according to business requirements, can be

applied to them. This tool is developed using HTML, PHP, SQL and Korn shell scripts.

_______________________, Committee ChairDr. Meiliu Lu

_______________________Date

v

ACKNOWLEDGMENTS

I am thankful to all the people who have helped and guided me through this

journey of completing my Masters Project.

My sincere thanks to Dr. Meiliu Lu, for giving me an opportunity to work under

her guidance on my masters project. She has been very supportive, encouraging and has

guided me throughout the project. My heartfelt thanks to Prof. Chung-E Wang for being

my second reader.

My special thanks to my friend Sreenivasan Natarajan for his patience reviewing

my project report. I would also like to thank all my friends who have been there for me

throughout my graduate program at California State University, Sacramento.

Last but not the least I would like to thank my parents, sister and relatives for

their unconditional love, support and motivation.

vi

TABLE OF CONTENTS Page

Acknowledgements............................................................................................................vi

List of Tables.......................................................................................................................x

List of Figures.....................................................................................................................xi

Chapter

1. INTRODUCTION...........................................................................................................1

2. BACKGROUND.............................................................................................................4

2.1 Need for an ETL tool.................................................................................................5

2.2 Scope of the project...................................................................................................6

2.3 Technology related....................................................................................................7

3. ETL COURSEWARE...................................................................................................11

3.1 ETL components......................................................................................................14

3.2 Requirements...........................................................................................................14

3.2.1 Business requirements......................................................................................15

3.2.2 Data profiling....................................................................................................15

3.2.3 Data integration requirements...........................................................................16

3.2.4 Data latency requirements................................................................................16

3.2.5 Data archiving requirements.............................................................................17

3.3 Data profiling...........................................................................................................17

3.4 Data extraction..................................................................................................19

3.5 Data validation and integration.........................................................................23

vii

3.6 Data cleansing...................................................................................................24

3.7 Data transformations.........................................................................................26

3.7.1 Surrogate key generator operation.............................................................27

3.7.2 Lookup operation.......................................................................................27

3.7.3 Merge operation.........................................................................................28

3.7.4 Aggregation operation......................................................................................29

3.7.5 Change capture operation.................................................................................29

3.7.6 Change apply operation....................................................................................30

3.7.7 Data type operation...........................................................................................31

3.8 Data load..................................................................................................................32

3.8.1 Historic load......................................................................................................32

3.8.2 Incremental load...............................................................................................33

3.8.3 Loading dimension tables.................................................................................34

3.8.3.1 Type 1 Slowly Changing Dimension.........................................................35

3.8.3.2 Type 2 Slowly Changing Dimension.........................................................36

3.8.4 Loading fact tables...........................................................................................40

3.9 Exception handling..................................................................................................43

4. ETL TOOL ARCHITECTURE.....................................................................................45

5. ETL TOOL IMPLEMENTATION................................................................................50

5.1 Using the tool..........................................................................................................50

5.2 Extraction phase......................................................................................................52

5.2.1 Text file as source............................................................................................53

viii

5.2.2 MySQL table as source....................................................................................59

5.3 Transformation phase..............................................................................................61

5.3.1 Transformation for a single source...................................................................61

5.3.2 Transformation for multiple sources.................................................................67

5.4 Loading phase.........................................................................................................68

6. CONLUSION................................................................................................................71

6.1 Future enhancements...............................................................................................72

Bibliography......................................................................................................................73

ix

LIST OF TABLES

Page

Table 1 Before snapshot of Store Dimension Table for Type 1 SCD...............................36

Table 2 After snapshot of Store Dimension Table for Type 1 SCD..................................36

Table 3 Snapshot 1 of Store Dimension Table for Type 2 SCD (Method 1) ...................38






Table 9 Table structure to Store Usernames and Password...............................................51

Table 10 Type of Input Box Based on Data Type.............................................................56

Table 11 Structure of INFORMATION_SCHEMA.COLUMNS Table...........................63

x

LIST OF FIGURES Page

Figure 1 Overview of ETL Process.....................................................................................2

Figure 2 Screenshot of Transformations Page.....................................................................6

Figure 3 Screenshot of Transformations and Load Page of ETL Tool................................7

Figure 4 ETL Process........................................................................................................13

Figure 5 Components of ETL............................................................................................14

Figure 6 OLTP source for ETL Process............................................................................20

Figure 7 Delimited File Format.........................................................................................21

Figure 8 Fixed-width File Format......................................................................................21

Figure 9 Overview of Data Transformation......................................................................26

Figure 10 Lookup Operation.............................................................................................27

Figure 11 Merge Operation...............................................................................................28

Figure 12 ETL Tool Layers...............................................................................................45

Figure 13 Layers and Components of ETL Tool...............................................................47

Figure 14 Add or Delete Users..........................................................................................51

Figure 15 Source Selection................................................................................................52

Figure 16 Screenshot of Define Metadata Page................................................................55

Figure 17 Flow for Landing Data Using Text Source.......................................................58

Figure 18 Flow of Landing Data Using MySQL Table as Source....................................60

Figure 19 Screenshot of Database Details Webpage.........................................................60

Figure 20 Flow of Tranformation Phase............................................................................61

xi

Figure 21 Screenshot Showing Various Transformations.................................................65

Figure 22 Flow of Landing Phase......................................................................................68

Figure 23 Screenshot Showing Transformations for Multiple Sources.............................69

xii

1

Chapter 1

INTRODUCTION

According to Bill Inmon [1], “A data warehouse is a historical, subject-oriented,

integrated, time-variant and non-volatile collection of data in support of management's

decision making process”. By Historical we mean, the data is continuously collected from

various sources and loaded in the warehouse. The previously loaded data is not deleted

for long period of time thus containing historical data in the warehouse. By Subject

Oriented we mean, data is grouped into specific business areas instead of the business as

a whole. By Integrated we mean, collecting and merging data from various sources and

these sources could be disparate in nature. By Time-variant we mean, that all data in the

data warehouse is identified with a particular time period. By Non-volatile we mean,

once data is loaded in the warehouse it is never deleted or overwritten; hence it is not

expected to change over time.

Extract, Transform, Load (ETL) is the back-end process which involves collecting

data from various sources, preparing the data according to business requirements and

loading it in the data warehouse. Extraction is the process where data is extracted from

various source systems and temporarily stored in database tables or files. Source systems

could range from one to many in number, and similar or completely disparate in nature.

Once the extracted data is staged temporarily it should be checked for validity and

2

consistency using the data validation rules. Transformation is the process which involves

application of business rules to source data before it's loaded into the data warehouse.

Figure 1 Overview of ETL process

As can be seen in Figure 1, there can be several data sources that have

characteristics which differ from each other. These data sources could be in a different

geographic location; could be incompatible with the organization’s data store; could be

many in number; could be on different platforms like mainframes, UNIX or Windows;

the availability of data from each source system may also vary. In Extraction phase, data

needs to be extracted from various source systems and placed temporarily in databases or

flat files called the landing zone [6].

3

In Transformation phase, the landed data is picked up, cleansed and transformed

based on the business requirements. There can be one or many transformation operations

that are applied to datasets, which could lead to a change in data value, change of data

type or change of data structure by addition or deletion of data from it. The transformed

data is loaded into databases tables and this area is called the staging area [6].

In Loading phase, which is the last step in ETL process, validated, integrated,

cleansed, transformed and ready-to-load data from the staging area is loaded into the

warehouse dimension and fact tables.

This report is organized into several chapters. Chapter 1 gives a brief introduction

about data warehouse and the role of ETL process in data warehousing projects. Chapter

2 discusses background and detailed introduction to ETL process. It discusses about the

need for an ETL tool, scope of the project and the related technology used to build the

ETL web tool. Chapter 3 discusses about the ETL courseware. The ETL courseware

contains material which a new user, to ETL processes, must know in order to implement

successful ETL projects. There are several components of ETL like requirements, data

profiling, extraction, validation, integration and others which this chapter explains in

detail. Chapter 4 gives an overview of architecture of ETL web tool created for this

project. Chapter 5 discusses the implementation of ETL web tool in details along with

snippets of important source code. Chapter 6 summarizes and concludes this report with a

glimpse into future enhancements to the courseware and the tool.

4

Chapter 2

BACKGROUND

Existence of data warehouse dates back to the late 1980s when IBM researchers

Barry Devlin and Paul Murphy developed the "business data warehouse". It meant to

provide an architectural model which focused on the flow of data from operational

systems to decision support systems. The architecture consisted of operational data layer,

data access layer, metadata layer and informational access layer. Operational data layer is

the source for a data warehouse; data access layer is the interface between data access

and informational access layer; metadata layer is the data dictionary; and informational

access layer is the last layer which is used by business analysts to analyze and generate

reports [8].

There are several approaches to populate a warehouse. The top down approach,

by Bill Inmon [1], proposed populating the data warehouse first and then populating the

data marts. The bottom up approach, by Ralph Kimball [3], proposed populating the data

marts first and then populating the data warehouse. There is also the hybrid approach

which is a combination of top down and bottom up approach [5]. An ETL tool is used in

data access layer to extract data from source systems and load them into the warehouse or

the mart irrespective of which approach is used.

5

2.1 Need for an ETL tool

Interested users who would like to learn more about ETL tools may not have

access to it. This is because commercial ETL tools are expensive to buy for small or

medium sized projects. They need expensive hardware to run on and needs specialists to

configure them before a normal user can start using it. Most of them are not open source

and web based. They also have short evaluation periods like 30 to 60 days. The ETL web

tool created in this project helps overcome the above challenges. It is web based,

accessible freely to anyone with internet access and very user-friendly. Beginning ETL

developers can use this tool to get a feel of what an ETL tool does before they dive into

understanding complex commercial ETL tools.

An ETL tool has many advantages over hand-coded ETL code. It helps in

simpler, faster and cheaper development of ETL code. Technical people with broad

business skills who are not professional programmers can use ETL tools effectively.

Many ETL tools generate metadata automatically at every step of the process thus

enforcing consistent metadata throughout the process. They also have integrated metadata

repositories which can be synchronized with other source systems, target systems and

other Business Intelligence tools. They deliver good performance with very large datasets

by using parallelism concepts such as pipelining and parallelism [4]. They come with

built-in connectors for most of source and target systems. Most of the ETL tools these

days have built-in schedulers to run the ETL code at scheduled times.

6

2.2 Scope of the project

This project mainly focuses on creation of a website which contains courseware

on ETL and a web based ETL tool.

The ETL courseware is available to everyone with an internet connection and is

intended for a wide variety of audience from beginners to experienced professionals.

Initially it explains about basics about each phase- Extract, Transform and Load; later it

delves into details of each phase. Below is a screenshot of a webpage from ETL

courseware website.

Figure 2 Screenshot of ETL courseware page

The ETL tool is web based and accessible to anyone with internet access. It

allows users to select data from two types of source systems. They can be Comma

Separated Value (CSV) flat files, MySQL tables or a combination both. Once data is

7

extracted, they can be merged together, data type based transformations applied to each

column and loaded to target MySQL target tables. Below is a screenshot of a webpage

from ETL web tool.

Figure 3 Screenshot of transformations and load page of ETL tool

2.3 Technology related

This section discusses the various technologies used in developing the ETL web tool.

The ETL tool is composed of three layers: the client layer, the processing layer and the

database layer which are described below along with the technologies used:

Client Layer: Users use a web browser, like Microsoft Internet Explorer, to

access/control the ETL tool. They can specify the number of sources, the type of

source, transformation to be applied on the extracted data and the target MySQL

table connection details. This layer is built using PHP and HTML.

8

Processing Layer: The processing layer collects the user input from the client

layer. It process user request in the background and displays success or error

messages to the user. This layer has MySQL and text file connectors which

connects to source and extract data, based on the user’s input. Once the data is

extracted, it is staged in temporary MySQL tables or flat files so that

transformations can be applied to it without disturbing the source content. This

layer is built using PHP, Korn shell scripts, MySQL statements and scripts.

Database Layer: The database layer has the target MySQL connector which

connect to the target MySQL table and inserts the transformed data. This layer is

built using Korn shell scripts, MySQL statements and scripts

MySQL 5.1

MySQL is an open source SQL database management system, developed,

distributed, and supported by Oracle Corporation. It is used for mission-critical, heavy

load production systems and delivers a very fast, multi-user, multi-threaded and robust

SQL database server [10].

Key Features

1. High performance for variety of workloads.

2. Connectors for C, ODBC, Java, PHP, Perl,.NET etc.

3. Wide range of supported platforms.

4. XML functions with XPath support

5. Partitioning

9

6. Row-based replication

7. Great documentation, community and commercial support [11].

PHP 4.3

PHP: Hypertext Preprocessor is a general-purpose scripting language that is

designed for web development to produce dynamic web pages. PHP code is embedded

into the HTML source document and this is interpreted by a web server installed with a

PHP processor module which generates the web page document.

Key Features

1. Persistent database connections

2. Good connection handling

3. Easy remote file handling

4. Better session handling

5. Good command line usage [12]

Korn Shell Scripts:

The Korn shell (ksh) is a Unix shell which was developed by David Korn. It is

backwards-compatible with the Bourne shell and includes many features of the C shell

[13].

Key Features

1. Supports associative arrays and built-in floating point arithmetic.

2. Support pipes

10

3. Supports pattern matching

4. Exception handling

5. Multidimensional arrays

6. Sub-shells

7. Unicode support [14]

In addition to above features, PHP and MySQL is extensively used in California State

University, Sacramento campus.

In this chapter we discussed about background of ETL tools. We mentioned the

need for ETL tool and several advantages of over hand-coded ETL. Scope of this project

was discussed along with the related technologies that were used to create a web-based

ETL tool. Now that the user is familiar with the overview and motivation for this project

we discuss more on the ETL courseware in Chapter 3.

11

Chapter 3

ETL COURSEWARE

ETL courseware can be read by anyone is who is interested to learn about ETL

processes. This courseware is very useful to users who have no prior knowledge of ETL

processes or about ETL tools. Professional ETL developers could also use this as

reference. ETL courseware is freely available to everyone who has internet access and

can be accessed at http://gaia.ecs.csus.edu/~web_etl/etl/.

The courseware starts with basics and then proceeds to advanced topics. Initially

it introduces ETL process to readers. Then it discusses about the various ETL

components. Each ETL component is explained in detail with an example and a figure for

easier understanding. Important topics like requirements, data profiling, data extraction,

data transformation and data loading are explained in depth.

After reading this courseware, readers should be able to define what ETL process

is and what its various components are. They’ll have a thorough understanding of what

each component does and they’ll be able to apply these concepts in ETL project

implementations.

Users can better understand the ETL courseware by using the ETL tool

implemented for this project. The ETL tool is simple to use, user-friendly and users can

learn basics of an ETL tool before trying their hands at commercially available complex

ETL tools.

12

Based on my prior work experience and by referring the following books, I have

compiled this courseware.

W.H Inmon, "Building the Data Warehouse" Fourth Edition

Jack E. Olson, "Data Quality: The Accuracy Dimension"

Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit: The

Complete Guide to Dimensional Modeling" Second Edition

Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger, "Mastering Data

Warehouse Design: Relational and Dimensional Techniques"

Ralph Kimball, Joe Caserta, "The Data Warehouse ETL Toolkit: Practical

Techniques for Extracting, Cleaning, Conforming, and Delivering Data"

Larissa T. Moss, Shaku Atre, "Business Intelligence Roadmap: The

Complete Project Lifecycle for Decision-Support Applications"

Ralph Kimball, Margy Ross, "The Kimball Group Reader: Relentlessly

Practical Tools for Data Warehousing and Business Intelligence"

The following sections brief about ETL process and its components. For more details

please refer the website at http://gaia.ecs.csus.edu/~web_etl/etl/

13

Extract, Transform and Load (ETL) is a fundamental process used to populate a

data warehouse. It involves extracting data from various sources, validating it for

accuracy, cleaning and making it consistent, transforming the data according to business

requirements and loading them into target data warehouse. Inside the Transform phase a

well-designed ETL system should also enforce data quality, data consistency and

conforms data so that data from various source systems can be integrated. Once the data

is loaded into target systems in a presentation-ready format, the end users can run queries

against them to make better business decisions. Even though ETL process consumes

roughly 70% of the resources they are hardly visible to the end users [5].

Figure 4 ETL process

14

3.1 ETL components

Below are the various components of a ETL process.

Figure 5 Components of ETL

This shows just one ETL flow which load one or multiple tables. Similarly in the

background there can be multiple ETL flows loading tables in the same data warehouse.

3.2 Requirements

Just like designing any system requires understanding the requirements first,

design of ETL system also should start with requirements analysis. All the known

requirements and constraints affecting the ETL system have to be gathered at one place.

Based on the requirements, architectural decisions should be made at the

beginning of the ETL project. Construction of ETL code should start only after

architectural decisions are baselined. Change in architecture at a later point of

implementation would result in implementing the entire system from the very beginning

since they affect hardware, software, personnel and coding practices. Listed below are the

major requirements.

15

3.2.1 Business requirements

Business requirements are the requirements of the ends users who use the data

warehouse. Based on the populated information content the end users can make better

informed decisions. Selection of source systems is directly dependent on the business

needs. Interviewing the end users to gather business requirements not only sets an

expectation as to what they can do with the data but also there exists a possibility of the

ETL team discovering additional capabilities in data sources that can expand end user’s

decision making capabilities.

3.2.2 Data profiling

Data profiling is the process of examining the available data present in the source

systems and collecting statistics about that data. The purpose of these statistics can be to

Assess the risk involved in integrating data from various applications, including

the challenges of joins.

Assess whether metadata accurately describes the actual values present in source

systems.

Understanding data challenges early stages of the project would avoid delays and

cost.

Data profiling examines the quality, scope and context of the source data and enables

the ETL team to build an effective ETL system. If source data is very clean and is well

maintained before it arrives at the data warehouse then minimal transformation is

required to load them into dimension and fact tables. If source data is dirty then most of

16

the ETL team’s effort would be in transforming, cleaning and conforming the data.

Sometimes source data might be deeply flawed and would not be able to support business

objectives. This in case the data warehouse project should be cancelled.

Data profiling gives the ETL team a clear picture of how much data cleaning

processes should be in place to achieve end users requirements. This would also result in

better estimates and timely completion of the project.

3.2.3 Data integration requirement

Data from transaction systems must be integrated before they arrive in the data

warehouse. In data warehousing it takes the form of conforming dimensions and

conforming facts.

Conformed dimensions contains common attributes from different databases so

that drill across reports can be generated using these attributes. Conformed facts are

common measures, like Key Performance Indicators (KPIs), from across different

databases so that these numbers can be compared mathematically.

In ETL system, data integration is a separate step and it involves mandating

common names of attributes, facts and common units of measurement.

3.2.4 Data Latency requirement

Data latency requirement from the end users specifies how quickly the data has to

be delivered to them. This requirement has a huge effect on the architecture of the ETL

system. The batch oriented ETL system can be sped up using efficient processing

17

algorithms, parallel processing and faster hardware but sometimes the end users require

data on a real-time basis. This requires a conversion of ETL system from batch-oriented

to real-time oriented.

3.2.5 Data archiving requirement

Archiving data after it’s loaded into the data warehouse is a safer approach when data

needs to be reprocessed or for auditing purposes.

3.3 Data profiling

Data profiling is the process of examining the available data present in the source systems

and collecting statistics about that data. The purpose of these statistics can be to

Assess the risk involved in integrating data from various applications, including

the challenges of joins.

Assess whether metadata accurately describes the actual values present in source

systems.

Understanding data challenges early stages of the project would avoid delays and

cost.

According to Jack Olson [2], data profiling employs analytic methods of looking at

data for the purpose of developing a thorough understanding of the content, structure and

quality of the data. A good data profiling system can process very large amounts of data,

and with the skills of the analyst, uncover all sorts of issues that need to be addressed.

18

Data profiling examines the quality, scope and context of the source data and

enables the ETL team to build an effective ETL system. If source data is very clean and is

well maintained before it arrives at the data warehouse then minimal transformation is

required to load them into dimension and fact tables. If source data is dirty then most of

the ETL team’s effort would be in transforming, cleaning and conforming the data.

Sometimes source data might be deeply flawed and would not be able to support business

objectives. This in case the data warehouse project should be cancelled. Data profiling

can be achieved using commercial tools or hand coded applications. Data profiling

process reads the source data and generates a comprehensive report on-

Data types of each field

Natural keys

Relationships between tables

Data statistics like maximum values, minimum values, most occurred values,

number of occurrences of each values etc...

Dates in non-date fields

Data anomalies like junk values, values outside given range, missing values etc...

Null values

Data profiling gives the ETL team a clear picture of how much data cleaning

processes should be in place to achieve end users requirements. This would also result in

better estimates and timely completion of the project.

19

3.4 Data extraction

This chapter focuses on extraction of data from the sources systems. Data

extraction is the process of selecting, transporting and consolidating the data from source

systems to the ETL environment.

As organizations grow they would like to add a lot of data sources to their

existing data store. Each source systems have characteristics which differ from each

other. These data sources could be in a different geographic location; could be

incompatible with the organization’s data store; could be many in number; could be on

different platforms like mainframes, UNIX or Windows; the periodicity (daily, weekly,

monthly basis etc.) of feeding the data to the warehouse could vary; the availability of

data from each source system may also vary. These vast differences in source system

characteristics make data integration challenging.

Data extracted from source systems are placed temporarily in databases or flat

files and this area is called the landing zone. Described below are few sources which are

commonly used by organizations to extract data.

OLTP systems:

OLTP stands for Online Transaction Processing Systems. These are a class of

systems which facilitate and manage transaction-oriented applications which require

faster data insertions and retrievals. These systems store their daily transactional data in

relational databases. These databases are normalized for faster insertion and retrieval

queries.

20

To extract data from these systems, ODBC drivers or native database drivers are used.

The disadvantage of ODBC driver over native database drivers are that they require more

steps, as shown in diagram below, and take more time. It’s a better approach to always

extract only the required data by using appropriate WHERE clause in the SELECT query.

Figure 6 OLTP source for ETL process

Flat files

A flat file is a operating system file which contains text or binary content, one

record per line. In ETL projects we usually come across two formats of flat file. Both are

described below.

21

Delimited file format

In this file format, data is stored in separate lines as shown below. Each line is

separated by a new line character and each field is separated by a delimiter. Field

delimiters can be a comma, pipe or other characters but have to remain the same for all

fields. Also each record could have a record delimiter. In this example the record

delimiter is a semi-colon. There could also be a quote delimiter for each field. It contains

a single quote or a double quote. The final delimiter, in this case an End-Of-File

character, denotes the end of that flat file.

Figure 7 Delimited file format

Fixed-width file format

In this file format, data is stored in separate lines similar to delimited file format

however each field occupies a fixed width and each line are of same width. Field values

which do not occupy the entire field length is filled with spaces. Each field can be

identified by the start and end position. In the example below field 1 starts at position 1

and ends at position 4, field 2 starts at position 5 and ends at 11 and so on.

Figure 8 Fixed-width file format

22

Web log sources

Most of the internet companies have a log called web log which stores visitors

information. These web logs record information posted or retrieved for that particular

website by each user. There are several uses for this. One is to analyze users click pattern

on their website and find out which webpage gets the most hits from which geographic

location. Based on this information they can further analyze and improve their website to

increase user traffic. In order to analyze web logs, they have to be extracted from various

regions, transformed and stored in data warehouses.

ERP systems:

ERP stands for Enterprise Resource Planning and was designed to provide

integrated enterprise solutions by integrating important business functions like inventory,

human resources, sales, financials etc. Since ERP systems are massive and extremely

complex it would take years to collect data in them according to business requirements.

This would be a valuable source for the ETL systems. Nowadays most of the ETL tools

provide ERP connectors to fetch data from ERP systems.

FTP

FTP stands for File Transfer Protocol and is a standard protocol used over TCP/IP

networks to transfer files between machines. When ETL tools don't have appropriate

connectors/adapters to connect to the source system, FTP pull/push is used to fetch data

into ETL environment.

23

FTP pull take place at the ETL end and is used when we are sure the source file

will be available at a pre-determined time. Here FTP is initiated by the ETL tool at a

scheduled time.

FTP push takes place at the source and is used when availability of the source file

is unknown. Here FTP is initiated by the source system when the file is created.

3.5 Data validation and integration

This chapter focuses on integration and validation of extracted data. Data

extracted from the source systems must be validated and integrated before proceeding to

the next phase of ETL process.

Data validation: Data that has been extracted from different source systems, and

landed in the landing zone, must be validated before integrating them. This phase makes

sure the source data has been completed transferred to the ETL environment. Also, it

makes sure the latest data in source systems have been extracted. It's important that the

extraction process only extracts the latest business data else it would lead to duplicates in

data warehouse if old data is reloaded.

There are several ways to check if complete source data is extracted:

If flat files are extracted, make sure the record count in source and target match.

For flat files, make sure the source and target checksum match.

For data extracted in to tables, make sure record count in source and target match.

24

Data integration: Data that has been extracted from different source systems, and

landed in the landing zone, must be integrated together after it has been validated. Also

data that is similar should be integrated with each other. Care must be taken to start off

the integration process only after all the required data sets are present in the landing zone

since source systems can be located in different geographic locations and would have to

be extracted at different time zones.

3.6 Data cleansing

This chapter focuses on cleansing the validated data. Data cleansing is a process

that makes sure the quality of data in a warehouse is maintained. It is also defined as a

process which helps to maintain complete, consistent, correct, unambiguous and accurate

data [2]. These attributes define the quality of data and are explained below.

Complete: Complete means data is defined for each instance without any NULL

values in them and records are not lost in the information flow. For example, in Social

Security Administration office, an individual's record with no SSN would be incomplete.

Consistent: The definition, value and format of the data must be same throughout

the warehouse. For example, California State University Sacramento is known by several

names: Sac state, CSUS, Cal Univ. Sacramento etc. To make this consistent only one

convention should be followed everywhere.

Correct: The value should be true and meaningful. For example, age cannot be

negative. Another example is, if a pallet contains 4 items then the same should reflect in

the warehouse.

25

Unambiguous: The data can have only one meaning. For example, there are

several cities in the U.S by the name New Hope but there is only one city by that name in

Pennsylvania. This unambiguous data should be loaded in the warehouse for clarity.

Accurate: Accurate means that the data loaded in the warehouse should be

complete, consistent, correct, unambiguous and must be derived or calculated with

precision.

There are several data-quality checks which can be enforced.

1) Column check: This check ensures that incoming data contains expected values as per

source system's perspective. Some of the column property checks are- checking for

NULL values in non-nullable columns, checking fields for unexpected lengths, checking

numeric values which fall outside a range, checking fields which contains other values

than what is expected etc...

2) Structure check: Column checks focuses on individual columns but structure check

focuses on relationship between those columns. Structure check can be enforced by

having proper primary keys and foreign keys so that they obey referential integrity.

Structure checks also enforce parent-child relationships.

3) Data and value check: Data and value check can be simple or complex. Example for

a simple check is if a customer is flies in business class then he'll get double number of

points to his account than a customer who flies economy class. Example for a complex

check is a customer cannot be in limited partnership with firm A and a member of board

for directors in firm B.

26

3.7 Data transformations

This chapter focuses on transforming the cleansed data to load it into the

warehouse. These transformations are applied based on the business requirements, data

warehouse loading approach (top down or bottom up) and source-to-target mapping

document.

A transformation operation takes input data, modifies it by applying one or more

functions and returns the output data. This could lead to a change in data value, change of

data type or change of data structure by addition or deletion of data from it. When

multiple functions are applied, the intermediate data are called data sets.

After data undergoes transformation according to requirements it's a better

approach to save it temporarily in a database before finally loading it into the warehouse.

This temporary area of storage is called as the staging zone. In the last step, which is the

data load phase, when loading data from staging area in to warehouse if the load process

fails then only the load process can be restarted avoiding going through cleansing and

transformation processes again.

Figure 9 Overview of data transformation

Described below are a few transformation operations. For a comprehensive list of

transformation operations please refer the online courseware.

27

3.7.1 Surrogate key generator operation:

A surrogate key is a number that uniquely identify a record in a database table

and is different from a primary key. A surrogate key is not derived from the application

data but the primary key is.

A surrogate key generator operation takes one input, adds a new column which

contains surrogate key for each record and outputs the result dataset. For each input

record, surrogate key is calculated based on 4 parameters - initial value, current value,

increment value and final value. If it's the first record in the dataset then surrogate key

generator assigns initial value to it. If it's not the first record then surrogate key generator

adds the increment value to the current value and stores it in the record. Usually current

value is stored in a flat file or in a database table.

3.7.2 Lookup operation

Lookup operation has an input dataset, reference data set and a output/final

dataset as shown in Figure 10.

Figure 10 Lookup operation

28

Lookup operation fetches the fields specified by the user and looks for a match in the

reference dataset and returns the joined records if they match else it returns NULL values

in place of reference columns

3.7.3 Merge operations

Merge operations can have 1 or more input datasets but only 1 output dataset. It

combines data from input datasets and results in the output. The criteria to use merge

operation is the number of fields in all input datasets should be same and data type of all

fields should match with each other. The result dataset will have the same number of

fields and same data type as its input datasets. The number of records in output dataset

would be the total of all input datasets together.

Figure 11 Merge operation

29

3.7.4 Aggregation operation

Aggregation operation takes a single input and produces a single output. It

classifies records, from the input dataset, into groups and computes totals, minimum,

maximum or other aggregate functions for each group outputting them on to output

dataset. Fields that needs to be grouped, to be used for aggregate function calculation,

must be specified by the user.

Below are a few examples of aggregate functions which commercial ETL tools provide

nowadays:

Maximum value

Minimum value

Mean

Percentage coefficient of variation

Standard deviation

Sum of weights

Sum

Missing values count

Range

Variance

3.7.5 Change capture operation

Change capture operation takes two datasets as input, makes a record of

differences and produces one output dataset. The input datasets are denoted as before and

30

after datasets. The output dataset contains records which represent changes made to the

before data set to obtain the after data set. The compare is based on a set of key fields,

records from the two data sets are assumed to be copies of one another if they have same

values in these key columns. The output dataset has an extract column which denotes

insert, delete, copy and edit. These terms are explained in Change apply operation

section.

3.7.6 Change apply operation

Change apply operation can be used only after change capture operation. It takes

the change data set, which is the resultant dataset from change capture operation, applies

the encoded change operations to the before data set to compute an after data set. Below

are the encoded change operations described:

Insert: The change record is copied to the output;

Delete: The value columns of the before and change records are compared. If the value

columns are the same or if the Check Value Columns on Delete is specified as False, the

change and before records are both discarded; no record is transferred to the output. If the

value columns are not the same, the before record is copied to the output.

Edit: The change record is copied to the output; the before record is discarded.

Copy: The change record is discarded. The before record is copied to the output.

31

3.7.7 Data type operation

Data type operations change the data type, precision or format of the input

dataset.

Data type conversions: Conversion from text to date, text to timestamp, text to number,

date to timestamp, decimal to integer are few examples

Precision conversions: Changing the numeric precision say from decimal value 3.12345

to decimal value 3.11 is an example for this conversion.

Format conversions: Changing date and timestamp formats is one of the examples for

this conversion. If input is 28/01/90, based on business requirement, this could be

changed to 1990-Jan-28.

Compare operations: Compare operation takes two inputs and produces a single output.

It compares a column by column comparison of the records. This can be applied to

numeric and alpha-numeric fields. The output dataset contains three columns. The first

column is the result column which contains a code giving the result of the compare

operation. The second column contains the columns of the first input link and the third

column contains the columns of the second input link. The result column usually has

numeric codes which denote if both the inputs are equal, first is empty value, second is

empty value, first is greater or first is lesser.

32

3.8 Data load

This section focuses on the last step of ETL process which is loading validated,

integrated, cleansed and transformed data into data warehouse.

As mentioned in the previous chapter, it's a better approach to stage the ready-to-

load data in temporary tables in a database. When loading data from staging area in to

warehouse if the load process fails then only the load process can be restarted avoiding

going through cleansing and transformation processes again.

Basically there are two types of data load namely historic load and incremental load.

3.8.1 Historic load

A data warehouse contains historical data. Based on user requirements data in

warehouse has to be retained for a particular duration of time. This duration could be

anywhere from a single year to several decades.

When a data warehouse is created i.e., tables in them are created, it contains no records.

This is since planning for creation of a warehouse could take several months or years.

During this time there would be lot of data in OLTP systems which act as source systems

for warehouse. Loading of this initial historical data into warehouse is initial historic

load.

Sometimes it may so happen that when data is loaded regularly in to the

warehouse ETL process might break and fixing it would take several hours to several

days. During this fix time there will be data in OLTP systems. Loading this data is also a

historic load.

33

3.8.2 Incremental load

Incremental load is the periodic load of data into warehouse. This process loads

the most recent data from OLTP systems. This process run periodically till the end of

warehouse's life. Incremental loads could run daily, weekly, fortnightly, monthly,

quarterly, yearly or at a scheduled time. For every incremental load there is a load

window within which the ETL load process should start and finish loading into target

warehouse. After end of load window, business users will usually start querying and

analyzing data in warehouse.

There are several operations involved in loading warehouse. Based on the type of

table being loaded, fact or dimension, the appropriate operation is selected. Few of data

load operations are described below:

Insert operation: This operation inserts data in to warehouse. If data already exists in

table, this operation will fail. Hence target table should be checked ahead of time before

executing this operation.

Update operation: This operation updates the existing records in warehouse. Unlike

insert operation this doesn't fail if update records are not found.

Upsert operation: This operation first executes update operation and if that fails then it

inserts records into warehouse.

Insert update operation: This operation first executes insert operation and if that fails

then it updates existing records in warehouse. Insert update operation is preferred to

upsert operation since it is more efficient [7].

34

Delete insert operation: This operation first executes delete operation and then inserts

source records in to warehouse.

Bulk load operation: Bulk load is a utility provided by major ETL vendors these days

which are faster and efficient in loading huge amounts (hundred millions) of data into

warehouse [4][7].

3.8.3 Loading dimension tables

A table that stores business related attributes and provides the context for fact

table is called a dimension table. They are in demoralized form.

A Dimension table contains a surrogate key which is a meaningless incrementing integer

value. Surrogate key values are generated and inserted by ETL process along with other

dimension attributes. This is made the primary key for dimension table and is used to join

with records in the fact table. By definition a surrogate key is supposed to be

meaningless. However it can be made meaningful by creating a surrogate key value by

intelligently combing data from other attributes in dimension table. But this would lead to

more ETL processing, maintenance and updates if the actual attributes, on which these

keys are based, change.

In addition to surrogate key, dimension table also contains a natural key. Unlike

surrogate key, natural key is derived from meaningful application data. Dimension table

also consists of other attributes.

A slowly changing dimension table is a dimension table which has updates

coming in for the existing records. According to user requirements older values in

35

dimension tables can be historically maintained or discarded. Based on this there are

three types of slowly changing dimensions. Loading data into dimension table differs

based on type of dimension table. Below is a detailed explanation of each type and how

to load them.

3.8.3.1 Type 1 Slowly Changing Dimension

For an existing record in dimension table, if there is an update on any or all

attributes from source systems, SCD Type 1 approach is to overwrite the existing record

without saving old values. This approach is used when already inserted record in

dimension table is incorrect and needs to be correct. Or when business users don't see any

use in keep history of previous values.

If record doesn't exist, then a new surrogate key value is generated, appended to

dimension attributes and inserted into the table.

If record does exist then the surrogate key of existing dimension record is fetched,

appended to new source record, the old record is deleted and then the new record is

inserted into dimension table.

Upsert, insert update or delete insert operations can be used in this scenario.

However care has to be taken to save existing surrogate key value when using delete

insert operation.

If there are a large number of Type 1 changes, then the best way to implement is

to prepare new dimension records in a new table. Then drop existing records in

dimension table and use bulk load operation.

36

Surr_Key Store_id Store_city Store_state Store_country

384729478 37287 Sacramento California United States

Table 1 Before snapshot of Store dimension table for Type 1 SCD

Surr_Key Store_id Store_city Store_state Store_country

384729478 37287 Los Angeles California United States

Table 2 After snapshot of Store dimension table for Type 1 SCD

Note: In the new snapshot store_city field is updated without changing surrogate key

value.

3.8.3.2 Type 2 Slowly Changing Dimension

In this approach, if there are any changes to dimension attributes for existing

records in dimension table, the old values are preserved.

When exiting record needs to be changed, instead of over writing, a new record with a

new surrogate key is generated and inserted into dimension table. This new surrogate key

is used in fact table from that moment onwards. There is no need to change or update

existing records in fact or dimension table.

Type 2 SCD requires a good change capture system in place to detect changes in

source systems and then notify ETL system. Sometimes update notifications won't be

propagated from source to ETL system. In this case ETL code is supposed to download

the complete dimension and make a field by field, record by record comparison to detect

updates. If dimension table has millions of records and has over 100 fields then this

would be a very time consuming process and ETL code can't complete within the

37

specified load window. In order overcome this problem, CRC codes are associated with

each record. Entire record is given as input to CRC function which calculates a unique

long integer code. This integer code will change even if there is a change of a single

character in input record. When CRC codes are associated with each record then only

these codes are compared instead of field by field, record by record comparison.

To implement this in ETL code, there are two flows. One flow has the new record and the

other flow has the old record from dimension table. For the new record, a new surrogate

key is generated along with current flag or start and end date values. This is used in insert

operation. For the old record, current flag or start and end date values are updated and is

used in update operation.

There are 2 ways to implement SCD type 2:

1) Method 1: In the first method, a new flag column is added to the dimension table

which indicates where the record is current or now. In the example below when a new

record with surrogate key 384729479 is added, it's current flag is inserted with "Y" and

the current flag field value of record with surrogate key 384729478 is made "N". The

same can be applied for snapshot 3.

Snapshot 1 of a Store dimension table:

38

Surr_Key Store_id Store_city Store_state Store_country Current

flag

384729478 37287 Sacramento California United States Y

Table 3 Snapshot 1 of Store dimension table for Type 2 SCD (Method 1)

Snapshot 2:


flag

384729478 37287 Sacramento California United States N

384729479 37287 Los Angeles California United States Y


Snapshot 3:


flag

384729478 37287 Sacramento California United States N

384729479 37287 Los Angeles California United States N

384729521 37287 Arlington Texas United States Y


2) Method 2: In the second method, two columns are added to dimension table namely

start date and end date. Start date has the current date when the record was inserted and

39

end date has either high date (which is 12/31/9999) or Current_date - 1 value. As you can

see from the example below, in snapshot 2, when a new record with new surrogate key is

inserted, its end date has high date and start date has current date. The previous record's

end date would be updated with Current_date - 1. The same can be applied to snapshot 3.

To find the latest record, a query is run against dimension table where end_date =

'12/31/9999'. Using method 2

Snashot 1 of a Store dimension table:

Surr_Key Store_id Store_city Store_state Store_country Start date End date

384729478 37287 Sacramento California United States 10/1/2010 12/31/9999


Snapshot 2:



384729479 37287 Los Angeles California United States 10/21/2010 12/31/9999


Snapshot 3:



40

384729479 37287 Los Angeles California United States 10/21/2010 11/01/2010

384729521 37287 Arlington Texas United States 11/02/2010 12/31/9999


Please refer the online courseware for loading Type 3 SCD table.

3.8.4 Loading fact tables

Fact tables contain measurements or metrics of business processes of an

organization. According to Ralph Kimball [5], "measurement is an amount determined by

observation with an instrument or a scale".

Fact tables are defined by their grain. A grain represents the most atomic level by

which facts are defined. For example, in a sales fact table, the grain could be an

individual line item of a sales receipt.

Fact tables contain one or more measurements along with a set of foreign key

which point to dimension tables. Dimension tables are built around fact tables to provide

context to measurements present in fact tables. Just a fact table, without any dimension

tables surrounding it, makes no business sense.

Each fact table has a primary key which is a field or a group of chosen fields.

Primary key of fact table should be defined carefully such that during the load process

duplicates don't occur. There is this possibility of occurrence when insufficient attention

is paid during fact table's design or unexpected values starts flowing in from source

systems. When this happens then there is no way to differentiate those records apart. To

41

avoid this it's a good approach to have a unique key sequence included for each fact

record insert.

To insert source records into fact table each natural dimension key should be

replaced by latest surrogate key. Natural key value could always be found out from

respective dimension table. Surrogate keys are looked up using the lookup operation

defined in Transformations chapter.

Below are a few points, which can improve load performance, to keep in mind

when loading a fact table:

1) Insert and update records should be loaded separated. This can be done by writing the

temporary update and insert data into different datasets in staging area and creating two

separate ETL flows to load these records into fact table. Many vendor ETL tools usually

provide upsert and/or insert-update options. In this scenario, insert-update option works

efficiently.

2) Avoid SQL insert statements and use bulk load utility, if available, to insert huge

number of records efficiently thus improving performance of ETL code.

3) Load data in parallel if there are no dependencies between ETL flows. For example, if

two tables being loaded do not have a parent-child relationship then ETL code to load

both could be started simultaneously. Many vendor ETL tools provide partitioning and

pipelining mechanisms to load data in parallel. In partitioning mechanism, a huge dataset

is partitioned and several processes are created to work on them in parallel. In pipelining

mechanism, after a process finishes processing a part of a huge dataset, it passes the

42

processed chunk to next stage in ETL code. These two mechanisms speed up ETL

processing significantly.

4) Disable rollback logging for databases which house data warehouse tables. Rollback

logging is best suited for OLTP applications which require recovery from uncommitted

transaction failures but for OLAP applications this consumes extra memory and CPU

cycles since all data is entered and managed by ETL process [4].

5) Temporarily disabling indexing for fact tables when loading data and enabling them

when the load in complete is a great performance enhancer. Also another option is to

drop unnecessary indexes and rebuild only required ones.

6) Partitioning very big fact tables improves user's query performance. A table and its

index can by physically divided and stored on separate disks. By doing so a query that

requires a month of data from a table having millions of records can directly fetch data

from that particular physical disk without scanning other data [4].

Few of the above steps could be thought ahead of time and implemented thus saving time

in redo code changes at a later stage in the project.

3.9 Exception handling

This chapter discusses about exception handling in ETL process. An exception in

ETL is defined as any abnormal termination, unacceptable event or incorrect data which

stops the ETL flow and thus stopping data reaching data warehouse. Exception handling

43

is the process of handling these exceptions without corrupting any existing committed

data and terminating the ETL process gracefully. During ETL process exceptions occur

either due to incorrect data or due to infrastructure issues.

Data related exceptions could be caused because of incorrect data formats,

incorrect values or incomplete data from the source systems. These records needs to be

captured as rejects either in a file or in a database table and should be corrected,

reprocessed and inserted in the next ETL run.

Infrastructure exception could be caused because of hardware, network, database,

operating systems or other software issues. In such scenarios when ETL jobs fail, care

must be taken to make them restart able. Making the jobs restart able means that when

ETL jobs are restarted they don't insert duplicate data or abort due to existing data.

Exception handling should be handled in extraction, validation & integration,

cleansing, transformations and data load phase in order to have a stable and efficient ETL

code in place.

In this chapter we discussed about the courseware. Several components of ETL

process which are important for every ETL project implementation is discussed. Any user

who wishes to participate in implementation of ETL projects should know about these

components and should include appropriate exception handling mechanisms in place for

successful implementations. In the next chapter the architecture of ETL web tool, created

for this project, is discussed along with its various components and how they interact

with each other.

44

Chapter 4

ETL TOOL ARCHITECTURE

45

This chapter describes the architecture of the tool implemented in this project.

Initially the various layers, their interaction with each other and how they integrate to

form a system are discussed. Then the components of the tool are discussed in detail.

The ETL tool is composed of three layers: the client layer, the processing layer

and the database layer.

Figure 12 ETL tool layers

The client layer is the one which is visible to the end user and is used by the user

to interact with the system. In the processing layer, the processing of the user input takes

places. This has source connectors for text file and MySQL database. The database layer

has the target connector for MySql database and uses it to connect to target MySQL

database in order to insert records.

Users use a web browser, like Microsoft Internet Explorer, to control the ETL

tool. They can specify the number of sources, the type of source, transformation to be

applied on the extracted data and the target MySQL table connection details.

46

The processing layer collects the user input from the client layer. It process user

request in the background and displays success or error messages to the user. This layer

has MySQL and text file connectors which connects to the source, to extract data, based

on the user input. Once the data is extracted, it is staged in temporary MySQL tables or

flat files so that transformations can be applied to it without disturbing the source content.

The database layer has the target MySQL connect which connect to the target

MySQL table and inserts the transformed data.

Figure 13 shows various components of the tool and how it is connected with the

layers.

47

Figure 13 Layers and components of ETL tool

Users must use a web browser to access the tool and courseware. There are two

types of users: Guest and Registered user. Guest user is one who is not a student, faculty

or staff of California State University, Sacramento. Guest users don’t need username or

password to access the tool or courseware but have limited access to the tool. They can

only select sample source text files or sample database tables and write only to sample

target database tables. Option to enter database and file details are blocked to them. On

the other hand a registered user is one who is a student, faculty or staff of California State

University, Sacramento. Registered users require a username and password to login and

get unrestricted access to the tool. They can specify custom absolute file paths and

48

custom MySQL tables as source. These tables can be in any database located within

California State University, Sacramento campus and should be accessible via campus

LAN. They can also specify custom target MySQL table to load their source data.

When users, guest or registered, opens the homepage they have an option to either

go to courseware or the tool. When they click on the tool link, first page that they see is

the login page. When they don’t have a username and password they can click on the

guest link to access.

Once they are in the tool they come across extract, transform and load pages in

the same order. When they are in extract page, they can select either text file or MySQL

table as the source and enter the details like absolute path or the database details. When

they click on continue, the input details are passed on to the processing layer as shown in

Figure 13.

The processing layer reads the input details and uses source connectors to validate

the input data. If the user specifies text as source then the text source connector checks if

the file exists and is readable before proceeding to the next step. If the user specifies

MySQL table then the MySQL source connector connects to the source database and

makes sure it has table read privileges before proceeding further. Once the validation is

successful the processing layer copies over the source data to landing zone. A landing

zone is a temporary work area where the source data is landed in so that it can be

processed by the tool. The reason for having landing zone is that the source data remains

untouched and the tool has complete access to manage and modify the files in the landing

zone.

49

The next step is to define metadata for the selected source. If the source selected

is text then the processing layer display a webpage to define metadata for individual

columns. If the source is MySQL table then the processing layer automatically fetches the

table’s metadata from database dictionary.

Once metadata for source is defined the user chooses transformations which need

to be applied on the extracted source data. Applying the transformations too is taken care

by the processing layer with the help of PHP and Korn shell scripts. The transformed data

is temporarily stored in temporary files or MySQL tables and this zone is called the

staging zone.

Data from the staging zone is picked up by the loading layer and is loaded into

target MySQL tables using the target connectors.

This chapter gave an overview of the architecture, layers and the components of

the tool. It also discussed about the system design. Once the reader is familiar with design

and architecture it helps him to better understand the detailed implementation of the tool

in Chapter 5.

Chapter 5

ETL TOOL IMPLEMENTATION

50

This chapter describes the implementation of the ETL web tool. It discusses about

the guest and registered user login process. Based on the type of user several features

may or may be available. All options available in each phase are explained along with

screenshots and examples for each option.

5.1 Using the tool

Users must use a web browser to use the ETL tool. There are two types of users-

guest and registered user. A guest is anyone who is not a student, faculty or staff of

California State University, Sacramento. Guests don’t need a username or password to

use the tool. They click on the hyperlink “Guest? Click here”. When using the guest

credentials they have limited access to source files, source tables and target tables. Four

sample files along with their absolute path and four sample tables are mentioned in the

tool online. Guests cannot enter source tables or target tables details as they are disabled

in the tool.

A registered user is one who is a student, faculty or staff of California State

University, Sacramento and has been supplied username and password by the professor.

Registered users have unrestricted access unlike guest users. They can enter absolute path

of the source files which they want to load. They can enter credentials like server name,

database name, table name, user name and password of a different MySQL source and

target databases which can be anywhere on-campus and are available via college LAN.

New users can be added or existing users can be deleted by the professor by

logging in as administrator. Username and password are stored in Users MySQL table in

51

web_etl database. This database can be accessed only by the administrator. Users MySQL

table structure is given below.

Username Varchar(50) Primary Key

Password Varchar(50) Not null

Table 9 Table structure to store usernames and password

From Table 9, username is of varchar datatype of length 50 and is the primary key

of the table. Password field is of varchar datatype of length 50 and has NOT NULL

constraint added to it. Each username entered must be unique in this table.

Figure 14 shows the flow administrator has to follow in order to add or delete

users from the system.

Figure 14 Add or delete users

The administrator opens the homepage, clicks on the ETL tool link, then enters

administrator username and password. Once inside the system, the administrator can add

new users or delete existing users.

5.2 Extraction phase

52

Users must use a web browser to use the ETL tool. The first page displays the

extraction phase of the tool. Here the user can select number of sources and the type of

source. The minimum number of sources is one and the maximum is two. The type of

source can be flat file and/or MySQL table. Below is the flow of how users can select

text or MySQL table as source.

Figure 15 Source selection

Users open homepage, choose ETL tool, login with their username and password

or choose guest login and then they can choose the number of sources and their type.

Both the options are explained in detail below.

5.2.1 Text file as source

If user is a guest user then he/she has limited options. They can only choose from

the list of sample files displayed on the webpage but if the user is a registered user then

he/she can specify custom absolute path name.

53

If the user chooses text file as one of the source then he/she has to specify the

absolute path of the file present on Linux. Below are the points which should be

followed-

1. The file should exist and should have read permissions.

2. The file should be in ASCII format.

3. It should be comma delimited without a final delimiter.

4. It should not have any quote or double quote to separate the fields

5. It can have maximum of ten fields.

When the user enters the absolute path of the source file and clicks Next the following are

checked in the validateSourceFile.ksh script:

1. Check if user has entered the file path in the input box. If yes proceed to step 2

else raise an error.

2. Check if the entered path is a file or directory. If yes proceed to step 3 else raise

an error.

3. Check if the file has read permissions. If yes proceed to step 4 else raise an error.

4. Check if the file size is equal to zero. If yes, it means that the file is empty, then

raise an error else proceed to defineMetaData.php page to define the metadata of

the source file.

Below is the code snippet of validateSourceFile.ksh

## Check for number of arguments to this program## 1st argument is the filenameif [[ $# -ne 1 ]]; then echo "Error: Name of the source file not entered" exit 1;fi

54

## Assign 1st arguement to var sourceFilesourceFile=$1

## Check if var sourceFile is emptyif [[ -z $sourceFile ]]; then echo "Error: Filename supplied is empty" exit 1;fi

## Check if var sourceFile is a directoryif [[ -d $sourceFile ]]; then echo "Error: Source file path supplied is a directory. Please specify a file" exit 1;fi

## Check for file permissionsif [[ ! ( -r $sourceFile ) ]]; then echo "Error: Source file path is incorrect or source file doesn't have read permissions" exit 1;fi

## Check for empty fileif [[ ! ( -s $sourceFile ) ]]; then echo "Error: Source file path is incorrect or source file is empty" exit 1;fi

Once the source file is validated by validateSourceFile.ksh, the next step is to

define the metadata in defineMetaData.php page. The ETL tool allows up to ten fields to

be defined for a source text file. The metadata consists of the field name, its data type and

its length. The defineMetaData.php page displays a snapshot of 10 lines of the source file

to the user so that they can refer it while defining the metadata. Below is a screenshot of

web page which allows users to define metadata for source file.

55

Figure 16 Screenshot of Define Metadata page

The first field is the name field. This is an input box where users have to type in the field

name. The fields in source have to be named in this page and the source file shouldn’t

contain field names as their first record.

The second field is the data type field. It is a drop down box and only one option can be

selected. It has the following values:

Varchar

Char

Integer

Date

Timestamp

Decimal

56

The third field is an input box where the length of the field has to be specified by the user.

This should contain only numeric values and should be greater than zero. This field is

dynamic and can display length, precision or nothing based on the data type chosen by

the user. Below is the table showing the data type chosen and the display that appears on

the webpage.

Data type selected Display

Varchar Length input box

Char Length input box

Integer Length input box

Date No display

Timestamp No display

DecimalPrecision and Scale input

boxes

Table 10 Type of input box based on data type

Below is a snippet of HTML and JavaScript which dynamically changes the length field

based the data type selected.

<script type="text/javascript">$(document).ready(function(){

$("#bloc1").change (function() {

if ($("#bloc1").val() == 'varchar' || $("#bloc1").val() == 'char' || $("#bloc1").val() == 'integer' ) { $(".col1").show(); $(".col1_d").hide();

57

$(".col1_t").hide(); } else if ($("#bloc1").val() == 'date' || $("#bloc1").val() == 'timestamp') { $(".col1").hide(); $(".col1_d").hide(); $(".col1_t").show(); } else { $(".col1").hide(); $(".col1_d").show(); $(".col1_t").hide(); }

});$("#bloc1").change();}...</script>..<tr>

<td>Column1: Name <input type="text" name="col1_name"><select id="bloc1" name="col1_select">

<option SELECTED value="0"></option><option value="varchar">varchar</option><option value="char">char</option><option value="integer">integer</option><option value="date">date</option><option value="timestamp">timestamp</option><option value="decimal">decimal</option>

</select><td class="col1">

Length <input type="text" name="col1_length" ></td><td class="col1_t"></td><td class="col1_d">

Precision <input type="text" name="col1_precision" >Scale <input type="text" name="col1_scale" >

58

</td></td>

</tr>

Once metadata is defined, temporary MySQL tables are created in web_etl database to

hold the source file. Figure 17 shows the source flow.

Figure 17 Flow for landing data using text source

Internally in PHP source code, a custom CREATE TABLE SQL statement is prepared

from the manually defined metadata and is run against the web_etl database. This creates

a table similar to the source data. Then the source data is loaded into temporary table

using the load scripts

5.2.2 MySQL table as source

59

If the user is a guest user then he/she has limited options. They can only choose

from the list of sample tables displayed on the webpage but if the user is a registered user

then they can specify details as described in the following sections.

If the user chooses a MySQL table as one of the source then the following details

must be specified-

1. Name of the remote server. The server should be within the college network and

accessible.

2. Name of the database in the remote server.

3. Name of the table.

4. Username

5. Password

The script source_mysql.php accepts the above input and validates the following before

proceeding further-

1. Checks the connection to the remote server

2. Validates the username and password

3. Checks if the database exists

4. Checks if the table exists.

If the above points are satisfied then the source table structure and data are captured by

createTempTableMySql.ksh. The user doesn’t have to manually define the metadata.

Below is the working of MySQL table source flow:

60

Figure 18 Flow of landing data using MySQL table as source

After MySQL source table details are entered, the PHP and Korn shell scripts extract the

metadata automatically from the source table by connecting to that database. It also

exports the source data into temporary files in landing zone. Then it creates temporary

table in web_etl database and loads the landing zone data into the temporary table.

Below is a screenshot of web page which allows users to enter credentials for source

MySQL table.

Figure 19 Screenshot of database details webpage

5.3 Transformation phase

61

Users must select transformation, based on their business requirements, after they have

selected the sources. There are several transformations available and are described below

in detail. The flow is show in Figure 19.

Figure 20 Flow of transformation phase

The transformations are based on number of sources selected during extract phase. If

multiple sources are selected then merge with duplicates and merge without duplicates

are the two options available. If a single source is selected then transformations are based

on data type of the fields in input dataset. After the data is transformed, it is written into

staging area ready to be loaded into target MySQL tables.

5.3.1 Transformation for a single source

When users choose single source, they have several transformations available to them

based on the data type of each field. The data type of each field is fetched from database

dictionary using the query 'SELECT COLUMN_NAME, DATA_TYPE FROM

INFORMATION_SCHEMA.COLUMNS WHERE table_name = ‘$tablename’;

The table below shows the structure of COLUMNS table present in

INFORMATION_SCHEMA database [9]. Of all the fields present in this table only

62

COLUMN_NAME and DATA_TYPE fields are required to display the several

transformations available to the user.

Field Type Null

TABLE_CATALOG

TABLE_SCHEMA

TABLE_NAME

COLUMN_NAME

ORDINAL_POSITION

COLUMN_DEFAULT

IS_NULLABLE

DATA_TYPE

CHARACTER_MAXIMUM_LENGTH

CHARACTER_OCTET_LENGTH

NUMERIC_PRECISION

NUMERIC_SCALE

CHARACTER_SET_NAME

COLLATION_NAME

COLUMN_TYPE

COLUMN_KEY

EXTRA

PRIVILEGES

varchar(512)

varchar(64)

varchar(64)

varchar(64)

bigint(21) unsigned

longtext

varchar(3)

varchar(64)

bigint(21) unsigned

bigint(21) unsigned

bigint(21) unsigned

bigint(21) unsigned

varchar(32)

varchar(32)

longtext

varchar(3)

varchar(27)

varchar(80)

YES

NO

NO

NO

NO

YES

NO

NO

YES

YES

YES

YES

YES

YES

NO

NO

NO

NO

63

COLUMN_COMMENT varchar(255) NO

Table 11 Structure of INFORMATION_SCHEMA.COLUMNS table

The different data types and the transformations available are described below.

Integer Data type

No transformations available for this data type

Varchar and Char Data type

Convert to lower case: This option converts the input string or characters to

lower case. Example: Input “ABC”; Output “abc”

Convert to upper case: This option converts the input string or characters to

upper case. Example: Input “Abc”; Output “ABC”

Remove leading spaces: This option removes leading spaces from the input

string or characters, if any. Example: Input “ abc ”; Output “abc ”

Remove trailing spaces: This option removes trailing spaces from the input

string or characters, if any. Example: Input “ abc ”; Output “ abc”

Remove leading and trailing spaces: This option removes leading and trailing

spaces from the input string or characters, if any. Example: Input “ abc ”;

Output “abc”

Decimal type

64

Round: This option converts the input decimal to an approximately equal,

simpler and shorter representation. Example: Input 2.20 ; Output 2.00

Ceiling: This option converts the input decimal to its smaller integer value.

Example: Input 2.20 ; Output 3

Floor: This option converts the input decimal to its largest integer value.

Example: Input 2.20 ; Output 2

Absolute value: This option converts the input decimal to its absolute value.

Example: Input -2.20 ; Output 2.20

Date and Timestamp data type

Get date: This option extracts the date part from date or timestamp input.

Example: Input “2010-01-04 14:09:02”; Output “2010-01-04”

Get day: This option extracts the day of the month from the date or timestamp

input. Example: Input “2010-01-04 14:09:02”; Output “3”

Get day of the week: This option returns the weekday index from the date or

timestamp input. Returns 1 for Sunday, 2 for Monday…. 7 for Saturday. Example:

Input “2010-10-18”; Output 2

Get month: This option returns the month from the date or timestamp input.

Example: Input “2010-10-18”; Output 10

Get name of the month: This option returns the month name from the date or

timestamp input. Example: Input “2010-10-18”; Output “October”

Get quarter: This option returns the quarter from the date or timestamp input.

The returned value is between 1 and 4. Example: Input “2010-02-18”; Output 1

65

Get year: This option returns the year from the date or timestamp input.

Example: Input “2010-02-18”; Output 2010

Below is a screenshot of web page shows the transformation page when a single source is

selected.

Figure 21 Screenshot showing various transformations

Below is the flow of source code

Below is the code from one_source.php which displays dynamically the transformations

available to user based on the data type of input fields.

echo "<table border='1'><tr>

<th>Column name</th><th>Datatype</th><th>Transformation</th>

</tr>";

66

$i=1;$j=0;while($row = mysql_fetch_array($result)){

echo "<tr>";echo "<td>" . $row[0] . "</td>";echo "<td>" . $row[1] . "</td>"; echo "<input type=hidden name=ROWS value=" . $num . " />";if ($row[1] == 'decimal'){

$j=$i+1;echo "<td> <select name=". $j .">"; echo"<option value=' '> </option>"; echo"<option value=round>Round</option>"; echo"<option value=ceil>Ceiling</option>"; echo"<option value=floor>Floor</option>"; echo"<option value=abs>Absolute Value</option>"; echo" </select> </td>";echo "<input type=hidden name=".$i." value=" . $row[0] . " />";

}else if ($row[1] == 'varchar' || $row[1] == 'char'){

$j=$i+1;echo "<td> <select name=".$j."> <option value=' '> </option>";echo "<option value=lower>Convert to lower case</option>";echo "<option value=upper>Convert to upper case</option>";echo "<option value=ltrim>Remove leading spaces</option>";echo "<option value=rtrim>Remove trailing spaces</option>";echo "<option value=trim>Remove leading & trailing

spaces</option>"; echo " </select> </td>";echo "<input type=hidden name=".$i." value=" . $row[0] . " />";

}

else if ($row[1] == 'timestamp' || $row[1] == 'date'){

$j=$i+1;echo "<td> <select name=".$j."> <option value=' '> </option>";echo "<option value=date>Get date</option>";echo "<option value=day>Get day</option>";echo "<option value=dayofweek>Get day of the week</option>";echo "<option value=month>Get month</option>";echo "<option value=monthname>Get name of the

month</option>";

67

echo "<option value=quarter>Get quarter</option>"; echo "<option value=year>Get year</option>"; echo " </select> </td>";echo "<input type=hidden name=".$i." value=" . $row[0] . " />";

}else{

$j=$i+1;echo "<input type=hidden name=".$i." value=" . $row[0] . " />";

}echo "</tr>"; $i=$i+2;

}echo "</table>";

5.3.2 Transformation for multiple sources

When users choose two sources, they have two options of transforming the input

data. One option is to merge the two with duplicates and the other is to merge them

without any duplicates. However the users should note that when they choose multiple

sources to merge, the number of fields in both the sources should be the same and the

data type of corresponding fields should also be the same.

Once they have selected the transformation, PHP and Korn shell scripts check if

both the datasets are compatible with each other by comparing the number of columns

and the data type of corresponding columns. If they match then are temporarily landed in

the staging zone which is a temporary MySQL table. If they don’t match then an error is

displayed to the user.

68

5.4 Loading phase

After the user chooses transformations to be applied to the source data, he/she has to

choose a MySQL table to load the source data. The transformed data is directly loaded

from the staging zone to target MySQL table as show in Figure 21.

Figure 22 Flow of loading phase

If the target table is in the same server and database as staging zone table then data is

directly copied over to the target table. If the target table is located on a different database

than staging zone table then data from staging zone is exported to temporary files and

then loaded into target table using the export scripts.

The MySQL table should be defined and should exist in the target database.

User has to specify the following details -

Name of the remote server. The server should be within the college network and

accessible.

Name of the database in the remote server.

Name of the table.

Username

Password

The script transform_source.php accepts the above input and validates the following

before proceeding further-

69

1. Checks the connection to the remote server

2. Validates the username and password

3. Checks if the database exists

4. Checks if the table exists.

5. Checks if the user has insert permissions.

If the above checks are satisfied then transform_source.php fetches the data from staging

zone and inserts the transformed data in the target table.

Below is a screenshot of web page shows the transformation and load page when a

multiple sources are selected.

Figure 23 Screenshot showing transformations for multiple sources

70

Below is the code snippet from transform_source.php which generates the custom SQL to

insert the data into target table.

$ROWS = $_POST[ROWS];$ROWS=$ROWS*2;$i=1;while ( $i <= $ROWS){

$a = $_POST[$i];$j=$i+1;$b = $_POST[$j];

if ($i == ($ROWS-1)){

$c = $c . $b ."(" . $a.")"; }else{

$c = $c . $b ."(" . $a."),"; }$i=$i+2;

}

$sql = "insert into $table select $c from table";

This chapter discussed about implementation of the ETL web tool. It discussed

options available for each phase in detail along with screenshot, description on how to

use, several options available, an example for each option and important source code

snippets to help understand the internal working of the tool for the user. Chapter 6 is the

final chapter which gives an overall summary along with possible future enhancement to

the courseware and tool.

71

Chapter 6

CONCLUSION

In this project ETL courseware and ETL tool implementation were discussed.

ETL courseware discusses important aspects, from initially requirements gathering to

final error handling processes, which are important for ETL developers to know in order

to implement ETL projects successfully. ETL tool implemented in this project can extract

from multiple heterogeneous sources, combine them together, apply various

transformations and load the transformed data into target database tables. The ETL tool is

implemented using PHP 4.3, Korn Shell scripts and MySQL 5.1.4 scripts.

As a conclusion to this report, we can say that this project has accomplished its

primary goals and objectives as discussed in scope section 2.2 of Chapter 2. The main

objective of the ETL courseware is to first introduce basic concepts to familiarize the

interested audience about ETL and then discuss advanced topics like cleansing,

transformations, dimension and fact table loads. The main objective of the ETL tool is to

provide free access to interested audience to learn what an ETL is and how it works. The

tool provides interactive graphical user interface and is user-friendly. It can extract from

heterogeneous sources, land them in landing zone, apply various types of

transformations, stage them in staging zone and finally load the transformed data into

database tables. The heterogeneous sources can be flat files or MySQL tables. There are

several transformations available to apply on the landed source data. The source and

72

target database must be MySQL database and must be connected via LAN within

California State University, Sacramento campus.

This project has helped me understand the basics of ETL tool’s internal working.

It also helped me learn new languages like PHP and Korn shell scripting and am thankful

that I got an opportunity to work on MySQL database.

6.1 Future enhancements

There a few limitations in this project which can be worked upon in the future to

enhance the ETL tool and courseware. The first limitation is limited number of source

and target connectors. Currently the tool has flat file and MySQL table connectors.

Oracle database, MS SQL server database and XML file connectors could be added. The

second enhancement would be to add more transformations like SCD, change-capture

and change apply stage to the ETL tool.

73

BIBLIOGRAPHY

[1] W.H Inmon, "Building the Data Warehouse" Fourth Edition

[2] Jack E. Olson, "Data Quality: The Accuracy Dimension"

[3] Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit: The Complete Guide to

Dimensional Modeling" Second Edition

[4] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger, "Mastering Data Warehouse

Design: Relational and Dimensional Techniques"

[5] Ralph Kimball, Joe Caserta, "The Data Warehouse ETL Toolkit: Practical Techniques

for Extracting, Cleaning, Conforming, and Delivering Data"

[6] Larissa T. Moss, Shaku Atre, "Business Intelligence Roadmap: The Complete Project

Lifecycle for Decision-Support Applications"

[7] Ralph Kimball, Margy Ross, "The Kimball Group Reader: Relentlessly Practical

Tools for Data Warehousing and Business Intelligence"

[8] Wikipedia, General Information about Data Warehouse, [Online].

Available: http://en.wikipedia.org/wiki/Data_warehouse

[9] MySQL, The INFORMATION_SCHEMA COLUMNS Table, [Online]

Available: http://dev.mysql.com/doc/refman/5.0/en/columns-table.html

[10] MySQL, Overview of MySQL, [Online]

Available: http://dev.mysql.com/doc/refman/5.1/en/what-is-mysql.html

[11] MySQL, What Is New in MySQL 5.1, [Online]

Available: http://dev.mysql.com/doc/refman/5.1/en/mysql-nutshell.html

74

[12] PHP, PHP features, [Online]

Available: http://php.net/manual/en/features.php

[13] Wikipedia, Korn Shell, [Online]

Available: http://en.wikipedia.org/wiki/Korn_shell

[14] Wikipedia, Comparison of Computer shells, [Online]

Available: http://en.wikipedia.org/wiki/Comparison_of_computer_shells

a courseware on etl process

Documents