etl testing (extract, transform, and load)
TRANSCRIPT
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 1/39
What is data warehouse?
A data warehouse is a electronic storage of
an Organization's historical data for the
purpose of reporting, analysis and data
mining or knowledge discovery.
Other than that a data warehouse can also be used for the purpose of data integration,
master data management etc.
According to Bill Inmon, a datawarehouse
should be subject-oriented, non-volatile,
integrated and time-variant.
Explanatory Note
Note here, Non-volatile means that the data once loaded in the warehouse will not get deleted
later. Time-variant means the data will change with respect to time.
The above definition of the data warehousing is typically considered as "classical" definition.
However, if you are interested, you may want to read the article - What is a data warehouse - A101 guide to modern data warehousing - which opens up a broader definition of data
warehousing.
What is the benefits of data warehouse?
A data warehouse helps to integrate data (see Data integration) and store them historically so thatwe can analyze different aspects of business including, performance analysis, trend, prediction
etc. over a given time frame and use the result of our analysis to improve the efficiency of
business processes.
Why Data Warehouse is used?
For a long time in the past and also even today, Data warehouses are built to facilitate reportingon different key business processes of an organization, known as KPI. Data warehouses also help
to integrate data from different sources and show a single-point-of-truth values about the
business measures.
Data warehouse can be further used for data mining which helps trend prediction, forecasts,
pattern recognition etc. Check this article to know more about data mining
What is the difference between OLTP and OLAP?
OLTP is the transaction system that collects business data. Whereas OLAP is the reporting andanalysis system on that data.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 2/39
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand, OLAP systems are deliberately denormalized for fast data
retrieval through SELECT operations.
Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales person at the
counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the
related system is a OLTP system.
On the other hand, the manager of the store might want to view a report on out-of-stock
materials, so that he can place purchase order for them. Such report will come out from OLAPsystem
What is data mart?
Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse
and each department may have separate data marts. These data marts can be built on top of thedata warehouse.
What is ER model?
ER model or entity-relationship model is a particular methodology of data modeling wherein the
goal of modeling is to normalize the data by reducing redundancy. This is different than
dimensional modeling where the main goal is to improve the data retrieval mechanism.
What is dimensional modeling?
Dimensional model consists of dimension and fact tables. Fact tables store different transactional
measurements and the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achive high degree of normalization but to facilitate easy and faster
data retrieval.
Ralph Kimball is one of the strongest proponents of this very popular data modeling techniquewhich is often used in many enterprise level data warehouses.
If you want to read a quick and simple guide on dimensional modeling, please check our Guide
to dimensional modeling.
What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… ―20kg‖, it does not mean anything. But if I say,
"20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 3/39
meaningful sense. These product, customer and dates are some dimension that qualified the
measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is a data element that
categorizes each item in a data set into non-overlapping regions.
What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always)numerical values that can be aggregated.
What are additive, semi-additive and non-additive measures?
Non-additive Measures
Non-additive measures are those which can not be used inside any numeric aggregation function
(e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage.Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-
additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the facttable.
Semi Additive Measures
Semi-additive measures are those where only a subset of aggregation function can be applied.Let’s say account balance. A sum() function on balance does not give a useful result but max() or
min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate;
however, average function might be useful.
Additive Measures
Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is
Sales Quantity etc.
"Classifying data for successful modeling"
What is data?
Let us begin our discussion by defining what is data. Data are values of qualitative or
quantitative variables, belonging to a set of items. Simply put, it's an attribute or property orcharacteristics of an object. Point to note here is, data can be both qualitative (brown eye color)and quantitative (20cm long).
A common way of representing or displaying a set of correlated data is through table typestructures comprised of rows and columns. In such structures, the columns of the table generally
signify attributes or characteristics or features and the rows (tuple) signify a set of co-related
features belonging to one single item.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 4/39
While speaking about data, it is important to understand the difference of data with other similar
terms like information or knowledge. While a set of data can be used together to directly derive
an information, knowledge or wisdom is often derived in an indirect manner. In our previousarticle on learning data mining, we have given examples to illustrate the differences in data /
information and knowledge. Using the same example, consider a store manager of a local market
sells hundreds of candles every Sunday to its customers. Which customer is buying the candleson any certain date, those are the data that are stored in the database of the store. These datagives information like how many candles are sold from the store per week - this information may
be valuable for inventory management. These information can be further used to indirectly infer
that people who buy candles on every Sunday goes to Church to offer a prayer. Now that'sknowledge - it's a new learning based on available information.
Another way to look at it is by considering the level of abstraction in them. Data is objective andthus have the lowest level of abstraction whereas information and knowledge are increasingly
subjective and involves higher levels of abstraction.
In terms of scientific definition, one may conclude that data have higher level of entropy thaninformation or knowledge.
Types of Data
One of the fundamental aspects you must learn before attempting to do any kind of datamodeling is the fact that how we model the data depends completely on the nature or type of
data. Data can be both qualitative and quantitative. It's important to understand the distinctions
between them.
Qualitative Data
Qualitative data are also called categorical data as they represent distinct categories rather than
numbers. In case of dimensional modeling, they are often termed as "dimension". Mathematical
operations such as addition or subtraction do not make any sense on that data.
Example of qualitative data are, eye color, zip code, phone number etc.
Qualitative data can be further classified into below classes:
NOMINAL :
Nominal data represents data where order of the data does not represent any meaningfulinformation. Consider your passport number. There is no information as such if your passport number is greater or lesser than some one else's passport number. Consider
Eye color of people, does not matter in which order we represent the eye colors, order
does not matter.
ID, ZIP code, Phone number, eye color etc. are example of nominal class of qualitative
data.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 5/39
ORDINAL :
Order of the data is important for ordinal data. Consider height of people - tall,medium, short. Although they are qualitative but the order of the attributes does matter,
in the sense that they represent some comparative information. Similarly, letter grades,
scale of 1-10 etc. are examples of Ordinal data.
In the field of dimensional modeling, this kind of data are sometimes referred as non-
additive facts.
Quantitative data
Quantitative data are also called numeric data as they represent numbers. In case of dimensional
data modeling approach, these data are termed as "Measure".
Example of quantitative data is, height of a person, amount of goods sold, revenue etc.
Quantitative attributes can be further classified as below.
INTERVAL :
Interval classification is used where there is no true zero point in the data and divisionoperation does not make sense. Bank balance, temperature in Celsius scale, GRE score
etc. are the examples of interval class data. Dividing one GRE score with another GRE
score will not make any sense. In dimensional modeling this is synonymous to semi-
additive facts.
RATIO :
Ratio class is applied on the data that has a true "zero" and where division does makesense. Consider revenue, length of time etc. These measures are generally additive.
Below table illustrates different actions that are possible to implement on various data types
ACTIONS --> Distinct Order Addition Multiplication
Nominal Y
Ordinal Y Y
Interval Y Y Y
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 6/39
Ratio Y Y Y Y
It is essential to understand the above differences in the nature of data and suggest appropriate
model to store them. Many of our analytical (e.g. MS Excel) and data mining tools (e.g. R) donot automatically understand the nature of the data, so we need to explicitly model the data for
those tools. For example, "R" provides 2 test function "is.numeric()" and "is.factor()" todetermine if the data is numeric or categorical (dimensional) respectively, and if the default
attribution is wrong we can use functions like "as.factor()" or "as.numeric()" to re-attribute the
nature of the data.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table referencesnumber of dimension tables so as the keys (primary key) from all the dimension tables flow into
the fact table (as foreign key) where measures are stored. This entity-relationship diagram lookslike a star, hence the name.
Consider a fact table that stores sales quantity for each product and customer on a certain time.Sales quantity will be the measure here and keys from customer, product and time dimension
tables will flow into the fact table.
If you are not very familiar about Star Schema design or its use, we strongly recommend you
read our excellent article on this subject - different schema in dimensional modeling
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 7/39
Data warehouse:
In 1980, Bill Inmon known as father of data warehousing. "A Data warehouse is a subject
oriented, integrated ,time variant, non volatile collection of data in support of management'sdecision making process".
Subject oriented : means that the data addresses a specific subject such as sales,inventory etc.
Integrated : means that the data is obtained from a variety of sources.
Time variant : implies that the data is stored in such a way that when some data ischanged.
Non volatile : implies that data is never removed. i.e., historical data is also kept.
2. What is the difference between database and data warehouse?
A database is a collection of related data.
A data warehouse is also a collection of information as well as a supporting system.
3. What are the benefits of data warehousing?
Historical information for comparative and competitive analysis.
Enhanced data quality and completeness.
Supplementing disaster recovery plans with another data back up source.
4. What are the types of data warehouse?
There are mainly three type of Data Warehouse are :
Enterprise Data Warehouse
Operational data store
Data Mart
5. What is the difference between data mining and data warehousing?
Data mining, the operational data is analyzed using statistical techniques and clusteringtechniques to find the hidden patterns and trends. So, the data mines do some kind of
summarization of the data and can be used by data warehouses for faster analytical processing
for business intelligence.
Data warehouse may make use of a data mine for analytical processing of the data in a fasterway.
What are the applications
of data warehouse?
Datawarehouse are used extensively in banking and
financial services, consumer goods.
Datawarehouse is mainly used for generating reports and
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 8/39
answering predefined queries. Datawarehouse is used for strategic purposes, performing
multidimensional analysis.
Datawarehouse is used for knowledge discovery and
strategic decision making using data mining tools.
7. What are the types of datawarehouse applications?
Info processing
Analytical processing
Data mining
8. What is metadata?
Metadata is defined as the data about data. Metadata describes the entity and attributes
description.
9. What are the benefits of Datawarehousing?
The implementation of a data warehouse can provide many benefits to an
organization. A data warehouse can : Facilitate integration in an environmentcharacterized by un – integrated applications.
Integrate enterprise data across a variety of functions.
Integrate external as well as internal data.
Support strategic and long – term business planning.
Support day – to – day tactical decisions.
Enable insight into business trends and business opportunities.
Organize and store historical data needed for analysis.
Make available historical data, extending over many years, which enables trend
analysis.
Provide more accurate and complete information.
Improve knowledge about the business.
Enable cost – effective decision making.
Enable organizations to understand their customers, and their needs, as well
competitors.
Enhance customer servicc and satisfaction.
Provide competitive advantage.
Provide easy access for end – users.
Provide timely access to corporate information.
10. What is the difference between dimensional table and fact table?
A dimension table consists of tuples of attributes of the dimension. A fact table can be
thought of as having tuples, one per a recorded fact. This fact contains some measured or
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 9/39
observed variables and identifies them with pointers to dimension tables.
ETL testing (Extract, Transform, and Load)
It has been observed that Independent Verification and Validation is gaining huge market potential and
many companies are now seeing this as prospective business gain. Customers have been offered
different range of products in terms of service offerings, distributed in many areas based on technology,
process and solutions. ETL or data warehouse is one of the offerings which are developing rapidly and
successfully.
Why do organizations need Data Warehouse? Organizations with organized IT practices are looking forward to create a next level of
technology transformation. They are now trying to make themselves much more operational with
easy-to-interoperate data. Having said that data is most important part of any organization, it may
be everyday data or historical data. Data is backbone of any report and reports are the baseline onwhich all the vital management decisions are taken.
Most of the companies are taking a step forward for constructing their data warehouse to store
and monitor real time data as well as historical data. Crafting an efficient data warehouse is not
an easy job. Many organizations have distributed departments with different applications running
on distributed technology. ETL tool is employed in order to make a flawless integration betweendifferent data sources from different departments. ETL tool will work as an integrator, extracting
data from different sources; transforming it in preferred format based on the business
transformation rules and loading it in cohesive DB known are Data Warehouse.
Well planned, well defined and effective testing scope guarantees smooth conversion of the
project to the production. A business gains the real buoyancy once the ETL processes areverified and validated by independent group of experts to make sure that data warehouse is
concrete and robust.
ETL or Data warehouse testing is categorized into four different engagements irrespective
of technology or ETL tools used:
New Data Warehouse Testing – New DW is built and verified from scratch. Data input
is taken from customer requirements and different data sources and new data warehouse
is build and verified with the help of ETL tools.
Migration Testing – In this type of project customer will have an existing DW and ETL
performing the job but they are looking to bag new tool in order to improve efficiency.
Change Request – In this type of project new data is added from different sources to an
existing DW. Also, there might be a condition where customer needs to change theirexisting business rule or they might integrate the new rule.
Report Testing – Report are the end result of any Data Warehouse and the basic propose
for which DW is build. Report must be tested by validating layout, data in the report andcalculation.
ETL Testing Techniques:
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 10/39
1) Verify that data is transformed correctly according to various business requirements and rules.
2) Make sure that all projected data is loaded into the data warehouse without any data loss and
truncation.3) Make sure that ETL application appropriately rejects, replaces with default values and reports
invalid data.
4) Make sure that data is loaded in data warehouse within prescribed and expected time frames toconfirm improved performance and scalability.
Apart from these 4 main ETL testing methods other testing methods like integration testing anduser acceptance testing is also carried out to make sure everything is smooth and reliable.
ETL Testing Process:
Similar to any other testing that lies under Independent Verification and Validation, ETL also go
through the same phase.
Business and requirement understanding
Validating
Test Estimation
Test planning based on the inputs from test estimation and business requirement
Designing test cases and test scenarios from all the available inputs
Once all the test cases are ready and are approved, testing team proceed to perform pre-
execution check and test data preparation for testing
Lastly execution is performed till exit criteria are met
Upon successful completion summary report is prepared and closure process is done.
It is necessary to define test strategy which should be mutually accepted by stakeholders before
starting actual testing. A well defined test strategy will make sure that correct approach has been
followed meeting the testing aspiration. ETL testing might require writing SQL statements
extensively by testing team or may be tailoring the SQL provided by development team. In anycase testing team must be aware of the results they are trying to get using those SQL statements.
Difference between Database and Data Warehouse Testing There is a popular misunderstanding that database testing and data warehouse is similar while the
fact is that both hold different direction in testing.
Database testing is done using smaller scale of data normally with OLTP (Onlinetransaction processing) type of databases while data warehouse testing is done with large
volume with data involving OLAP (online analytical processing) databases.
In database testing normally data is consistently injected from uniform sources while in
data warehouse testing most of the data comes from different kind of data sources which
are sequentially inconsistent.
We generally perform only CRUD (Create, read, update and delete) operation in database
testing while in data warehouse testing we use read-only (Select) operation.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 11/39
Normalized databases are used in DB testing while demoralized DB is used in data
warehouse testing.
There are number of universal verifications that have to be carried out for any kind of data
warehouse testing. Below is the list of objects that are treated as essential for validation in ETL
testing:- Verify that data transformation from source to destination works as expected
- Verify that expected data is added in target system
- Verify that all DB fields and field data is loaded without any truncation- Verify data checksum for record count match
- Verify that for rejected data proper error logs are generated with all details
- Verify NULL value fields
- Verify that duplicate data is not loaded- Verify data integrity
ETL Testing Challenges:
ETL testing is quite different from conventional testing. There are many challenges we faced
while performing data warehouse testing. Here is the list of few ETL testing challenges I
experienced on my project:- Incompatible and duplicate data.
- Loss of data during ETL process.
- Unavailability of inclusive test bed.- Testers have no privileges to execute ETL jobs by their own.
- Volume and complexity of data is very huge.
- Fault in business process and procedures.
- Trouble acquiring and building test data.
- Missing business flow information.
Data is important for businesses to make the critical business decisions. ETL testing plays asignificant role validating and ensuring that the business information is exact, consistent and
reliable. Also, it minimizes hazard of data loss in production.
In computing, Extract, Transform and Load (ETL) refers to a process in database usage and
especially in data warehousing that involves:
Extracting data from outside sources
Transforming it to fit operational needs, which can include quality levels
Loading it into the end target (database, more specifically, operational data store, datamart or data warehouse)
Extract
The first part of an ETL process involves extracting the data from the source systems. In manycases this is the most challenging aspect of ETL, since extracting data correctly sets the stage for
how subsequent processes go further.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 12/39
ETL Architecture Pattern
Most data warehousing projects consolidate data from different source systems. Each separatesystem may also use a different data organization and/or format. Common data source formats
are relational databases and flat files, but may include non-relational database structures such asInformation Management System (IMS) or other data structures such as Virtual Storage Access
Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside
sources such as through web spidering or screen-scraping. The streaming of the extracted data
source and load on-the-fly to the destination database is another way of performing ETL when
no intermediate data storage is required. In general, the goal of the extraction phase is to convertthe data into a single format which is appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if
the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 13/39
Transform
The transform stage applies a series of rules or functions to the extracted data from the source to
derive the data for loading into the end target. Some data sources will require very little or evenno manipulation of data. In other cases, one or more of the following transformation types may
be required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For example, if the
source data has three columns (also called attributes), for example roll_no, age, and salary, then
the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore
all those records where salary is not present (salary = null).
Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the
warehouse stores M for male and F for female)
Encoding free-form values (e.g., mapping "Male" to "M")
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Sorting
Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
Aggregation (for example, rollup— summarizing multiple rows of data— total sales for each
store, and for each region, etc.)
Generating surrogate-key values
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a
string in one column, into individual values in different columns)
Disaggregation of repeating columns into a separate detail table (e.g., moving a series of
addresses in one record into single addresses in a set of records in a linked address table)
Lookup and validate the relevant data from tables or referential files for slowly changing
dimensions.
Applying any form of simple or complex data validation. If validation fails, it may result in a full,
partial or no rejection of the data, and thus none, some or all the data is handed over to thenext step, depending on the rule design and exception handling. Many of the above
transformations may result in exceptions, for example, when a code translation parses an
unknown code in the extracted data.
Load
The load phase loads the data into the end target, usually the data warehouse (DW). Dependingon the requirements of the organization, this process varies widely. Some data warehouses may
overwrite existing information with cumulative information; frequently, updating extracted datais done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the
same data warehouse) may add new data in a historical form at regular intervals -- for example,
hourly. To understand this, consider a data warehouse that is required to maintain sales records
of the last year. This data warehouse will overwrite any data that is older than a year with newerdata. However, the entry of data for any one year window will be made in a historical manner.
The timing and scope to replace or append are strategic design choices dependent on the time
available and the business needs. More complex systems can maintain a history and audit trail ofall changes to the data loaded in the data warehouse.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 14/39
As the load phase interacts with a database, the constraints defined in the database schema — as
well as in triggers activated upon data load — apply (for example, uniqueness, referential
integrity, mandatory fields), which also contribute to the overall data quality performance of theETL process.
For example, a financial institution might have information on a customer in severaldepartments and each department might have that customer's information listed in a different
way. The membership department might list the customer by name, whereas the accounting
department might list the customer by number. ETL can bundle all this data and consolidate it
into a uniform presentation, such as for storing in a database or data warehouse.
Another way that companies use ETL is to move information to another application
permanently. For instance, the new application might use another database vendor and most
likely a very different database schema. ETL can be used to transform the data into a format
suitable for the new application to use.
An example of this would be an Expense and Cost Recovery System (ECRS) such as used by
accountancies, consultancies and lawyers. The data usually ends up in the time and billing
system, although some businesses may also utilize the raw data for employee productivity
reports to Human Resources (personnel dept.) or equipment usage reports to Facilities
Management.
Real-life ETL cycle
The typical real-life ETL cycle consists of the following execution steps:
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for data integrity, create aggregates or
disaggregates)
6. Stage (load into staging tables, if used)
7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to
diagnose/repair)
8. Publish (to target tables)
9. Archive
10. Clean up
Challenges
ETL processes can involve considerable complexity, and significant operational problems can
occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may exceed the expectations of
designers at the time validation and transformation rules are specified. Data profiling of a source
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 15/39
during data analysis can identify the data conditions that will need to be managed by transform
rules specifications. This will lead to an amendment of validation rules explicitly and implicitly
implemented in the ETL process.
Data warehouses are typically assembled from a variety of data sources with different formats
and purposes. As such, ETL is a key process to bring all the data together in a standard,homogeneous environment .
Design analysts should establish the scalability of an ETL system across the lifetime of its usage.This includes understanding the volumes of data that will have to be processed within service
level agreements. The time available to extract from source systems may change, which may
mean the same amount of data may have to be processed in less time. Some ETL systems have toscale to process terabytes of data to update data warehouses with tens of terabytes of data.
Increasing volumes of data may require designs that can scale from daily batch to multiple-day
micro batch to integration with message queues or real-time change-data capture for continuous
transformation and update.
Performance
ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per
second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-
network connections, and lots of memory. The fastest ETL record is currently held by
Syncsort,[1]
Vertica and HP at 5.4TB in under an hour which is more than twice as fast as theearlier record held by Microsoft and Unisys.
In real life, the slowest part of an ETL process usually occurs in the database load phase.Databases may perform slowly because they have to take care of concurrency, integrity
maintenance, and indices. Thus, for better performance, it may make sense to employ:
Direct Path Extract method or bulk unload whenever is possible (instead of querying the
database) to reduce the load on source system while getting high speed extract
most of the transformation processing outside of the database
bulk load operations whenever possible.
Still, even using bulk operations, database access is usually the bottleneck in the ETL process.
Some common methods used to increase performance are:
Partition tables (and indices). Try to keep partitions similar in size (watch for null values which
can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking (disable
constraint ...) in the target database tables during the load.
Disable triggers (disable trigger ...) in the target database tables during the load. Simulate
their effect as a separate step.
Generate IDs in the ETL layer (not in the database).
Drop the indices (on a table or partition) before the load - and recreate them after the load
(SQL: drop index ...; create index ...).
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 16/39
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 17/39
Pipeline: Allowing the simultaneous running of several components on the same data stream.
For example: looking up a value on record 1 at the same time as adding two fields on record 2.
Component: The simultaneous running of multiple processes on different data streams in the
same job, for example, sorting one input file while removing duplicates on another file.
All three types of parallelism usually operate combined in a single job.
An additional difficulty comes with making sure that the data being uploaded is relatively
consistent. Because multiple source databases may have different update cycles (some may be
updated every few minutes, while others may take days or weeks), an ETL system may berequired to hold back certain data until all sources are synchronized. Likewise, where a
warehouse may have to be reconciled to the contents in a source system or with the general
ledger, establishing synchronization and reconciliation points becomes necessary.
Rerunnability, recoverability
Data warehousing procedures usually subdivide a big ETL process into smaller pieces runningsequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with"row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs
will help to roll back and rerun the failed piece.
Best practice also calls for "checkpoints", which are states when certain phases of the process are
completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out sometemporary files, log the state, and so on.
Virtual ETL
As of 2010 data virtualization had begun to advance ETL processing. The application of datavirtualization to ETL allowed solving the most common ETL tasks of data migration and
application integration for multiple dispersed data sources. So-called Virtual ETL operates with
the abstracted representation of the objects or entities gathered from the variety of relational,semi-structured and unstructured data sources. ETL tools can leverage object-oriented modeling
and work with entities' representations persistently stored in a centrally located hub-and-spoke
architecture. Such a collection that contains representations of the entities or objects gatheredfrom the data sources for ETL processing is called a metadata repository and it can reside in
memor y[2]
or be made persistent. By using a persistent metadata repository, ETL tools can
transition from one-time projects to persistent middleware, performing data harmonization and
data profiling consistently and in near-real time.[citation needed ]
Dealing with keys
Keys are some of the most important objects in all relational databases as they tie everything
together. A primary key is a column which is the identifier for a given entity, where a foreignkey is a column in another table which refers a primary key. These keys can also be made up
from several columns, in which case they are composite keys. In many cases the primary key is
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 18/39
an auto generated integer which has no meaning for the business entity being represented, but
solely exists for the purpose of the relational database - commonly referred to as a surrogate key.
As there will usually be more than one datasource being loaded into the warehouse the keys are
an important concern to be addressed.
Your customers might be represented in several data sources, and in one their SSN (SocialSecurity Number ) might be the primary key, their phone number in another and a surrogate in
the third. All of the customers information needs to be consolidated into one dimension table.
A recommended way to deal with the concern is to add a warehouse surrogate key, which will be
used as a foreign key from the fact table.[3]
Usually updates will occur to a dimension's source data, which obviously must be reflected in the
data warehouse.
If the primary key of the source data is required for reporting, the dimension already containsthat piece of information for each row. If the source data uses a surrogate key, the ware house
must keep track of it even though it is never used in queries or reports.
That is done by creating a lookup table which contains the warehouse surrogate key and the
originating key.[4]
This way the dimension is not polluted with surrogates from various source
systems, while the ability to update is preserved.
The lookup table is used in different ways depending on the nature of the source data. There are
5 types to consider ,[5]
where three selected ones are included here:
Type 1: - The dimension row is simply updated to match the current state of the source system. The
warehouse does not capture history. The lookup table is used to identify which dimension row to
update/overwrite.Type 2: - A new dimension row is added with the new state of the source system. A new surrogate key is
assigned. Source key is no longer unique in the lookup table.
Fully logged: - A new dimension row is added with the new state of the source system, while the previous
dimension row is updated to reflect it is no longer active and record time of deactivation.
Tools
Programmers can set up ETL processes using almost any programming language, but building
such processes from scratch can become complex. Increasingly, companies are buying ETL toolsto help in the creation of ETL processes.
[6]
By using an established ETL framework, one may increase one's chances of ending up with
better connectivity and scalability[citation needed ]
. A good ETL tool must be able to communicate
with the many different relational databases and read the various file formats used throughout anorganization. ETL tools have started to migrate into Enterprise Application Integration, or even
Enterprise Service Bus, systems that now cover much more than just the extraction,
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 19/39
transformation, and loading of data. Many ETL vendors now have data profiling, data quality,
and metadata capabilities. A common use case for ETL tools include converting CSV files to
formats readable by relational databases. A typical translation of millions of records is facilitated by ETL tools that enable users to input csv-like data feeds/files and import it into a database with
as little code as possible.
ETL Tools are typically used by a broad range of professionals - from students in computer
science looking to quickly import large data sets to database architects in charge of company
account management, ETL Tools have become a convenient tool that can be relied on to getmaximum performance. ETL tools in most cases contain a GUI that helps users conveniently
transform data as opposed to writing large programs to parse files and modify data types - which
ETL tools facilitate as much as possible
Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and
technologies that transform raw data into meaningful and useful information for business
purposes. BI can handle large amounts of information to help identify and develop new
opportunities. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and long-term stability.[1]
BI technologies provide historical, current and predictive views of business operations. Common
functions of business intelligence technologies are reporting, online analytical processing,
analytics, data mining, process mining, complex event processing, business performancemanagement, benchmarking, text mining, predictive analytics and prescriptive analytics.
Though the term business intelligence is sometimes a synonym for competitive intelligence (because they both support decision making), BI uses technologies, processes, and applications
to analyze mostly internal, structured data and business processes while competitive intelligence
gathers, analyzes and disseminates information with a topical focus on company competitors. Ifunderstood broadly, business intelligence can include the subset of competitive intelligence.[
Slowly changing dimension
Dimension is a term in data management and data warehousing. It's the logical groupings of data
such as geographical location, customer or product information. With Slowly Changing
Dimensions (SCDs) data changes slowly, rather than changing on a time-based, regular
schedule.[1]
For example, you may have a dimension in your database that tracks the sales records of your
company's salespeople. Creating sales reports seems simple enough, until a salesperson is
transferred from one regional office to another. How do you record such a change in your sales
dimension?
You could calculate the sum or average of each salesperson's sales, but if you use that tocompare the performance of salesmen, that might give misleading information. If the salesperson
was transferred and used to work in a hot market where sales were easy, and now works in a
market where sales are infrequent, his/her totals will look much stronger than the other
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 20/39
salespeople in their new region. Or you could create a second salesperson record and treat the
transferred person as a new sales person, but that creates problems.
Dealing with these issues involves SCD management methodologies referred to as Type 0
through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
ype 0
The Type 0 method is passive. It manages dimensional changes and no action is performed.Values remain as they were at the time the dimension record was first inserted. In certain
circumstances history is preserved with a Type 0. High order types are employed to guarantee
the preservation of history whereas Type 0 provides the least or no control.
The most common types are I, II, and III.
Type I
This methodology overwrites old with new data, and therefore does not track historical data.
Example of a supplier table:
Supplier_Key Supplier_Code Supplier_Name Supplier_State
123 ABC Acme Supply Co CA
In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key.
Technically, the surrogate key is not necessary, since the row will be unique by the natural key(Supplier_Code). However, to optimize performance on joins use integer rather than character
keys.
If the supplier relocates the headquarters to Illinois the record would be overwritten:
Supplier_Key Supplier_Code Supplier_Name Supplier_State
123 ABC Acme Supply Co IL
The disadvantage of the Type I method is that there is no history in the data warehouse. It has theadvantage however that it's easy to maintain.
If you have calculated an aggregate table summarizing facts by state, it will need to berecalculated when the Supplier_State is changed.
[1]
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 21/39
Type II
This method tracks historical data by creating multiple records for a given natural key in the
dimensional tables with separate surrogate keys and/or different version numbers. Unlimitedhistory is preserved for each insert.
For example, if the supplier relocates to Illinois the version numbers will be incrementedsequentially:
Supplier_Key Supplier_Code Supplier_Name Supplier_State Version.
123 ABC Acme Supply Co CA 0
124 ABC Acme Supply Co IL 1
Another method is to add 'effective date' columns.
Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date
123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004
124 ABC Acme Supply Co IL 22-Dec-2004
The null End_Date in row two indicates the current tuple version. In some cases, a standardizedsurrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be
included in an index, and so that null-value substitution is not required when querying.
Transactions that reference a particular surrogate key (Supplier_Key) are then permanently
bound to the time slices defined by that row of the slowly changing dimension table. An
aggregate table summarizing facts by state continues to reflect the historical state, i.e. the statethe supplier was in at the time of the transaction; no update is needed.
If there are retrospective changes made to the contents of the dimension, or if new attributes areadded to the dimension (for example a Sales_Rep column) which have different effective dates
from those already defined, then this can result in the existing transactions needing to be updated
to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not
a good choice if the dimensional model is subject to change.[1]
Type III
This method tracks changes using separate columns and preserves limited history. The Type III
preserves limited history as it's limited to the number of columns designated for storing historical
data. The original table structure in Type I and Type II is the same but Type III adds additional
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 22/39
columns. In the following example, an additional column has been added to the table to record
the supplier's original state - only the previous history is stored.
Supplier_Ke
y
Supplier_Cod
e
Supplier_Nam
e
Original_Supplier_Sta
te
Effective_Dat
e
Current_Supplier_Sta
te
123 ABC Acme Supply
Co CA 22-Dec-2004 IL
This record contains a column for the original state and current state — cannot track the changesif the supplier relocates a second time.
One variation of this is to create the field Previous_Supplier_State instead of
Original_Supplier_State which would track only the most recent historical change.[1]
Type IV
The Type 4 method is usually referred to as using "history tables", where one table keeps thecurrent data, and an additional table is used to keep a record of some or all changes. Both the
surrogate keys are referenced in the Fact table to enhance query performance.
For the above example the original table name is Supplier and the history table is
Supplier_History.
Supplier
Supplier_key Supplier_Code Supplier_Name Supplier_State
123 ABC Acme Supply Co IL
Supplier_History
Supplier_key Supplier_Code Supplier_Name Supplier_State Create_Date
123 ABC Acme Supply Co CA 22-Dec-2004
This method resembles how database audit tables and change data capture techniques function.
Type 6 / hybrid
The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible
explanation of the origin of the term was that it was coined by Ralph Kimball during a
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 23/39
conversation with Stephen Pace from Kalido[citation needed ]
. Ralph Kimball calls this method
"Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit .[1]
The Supplier table starts out with one record for our example supplier:
Supplier_Key
Supplier_Code
Supplier_Name
Current_State
Historical_State
Start_Date
End_Date
Current_Flag
123 ABC Acme Supply
Co CA CA
01-Jan-
2000
31-Dec-
9999 Y
The Current_State and the Historical_State are the same. The Current_Flag attribute indicates
that this is the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:
Supplier_Ke
y
Supplier_Cod
e
Supplier_Na
me
Current_Stat
e
Historical_Sta
te
Start_Dat
e
End_Dat
e
Current_Fla
g
123 ABC Acme Supply
Co IL CA
01-Jan-
2000
21-Dec-
2004 N
124 ABC Acme Supply
Co IL IL
22-Dec-
2004
31-Dec-
9999 Y
We overwrite the Current_State information in the first record (Supplier_Key = 123) with thenew information, as in Type 1 processing. We create a new record to track the changes, as in
Type 2 processing. And we store the history in a second State column (Historical_State), which
incorporates Type 3 processing.
For example if the supplier were to relocate again, we would add another record to the Supplier
dimension, and we would overwrite the contents of the Current_State column:
Supplier_Ke
y
Supplier_Cod
e
Supplier_Na
me
Current_Stat
e
Historical_Sta
te
Start_Dat
e
End_Dat
e
Current_Fla
g
123 ABC Acme Supply
Co NY CA
01-Jan-
2000
21-Dec-
2004 N
124 ABC Acme Supply
Co NY IL
22-Dec-
2004
03-Feb-
2008 N
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 24/39
125 ABC Acme Supply
Co NY NY
04-Feb-
2008
31-Dec-
9999 Y
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State
are always the same.[1]
Type 2 / Type 6 fact implementation
Type 2 surrogate key with Type 3 attribute
In many Type 2 and Type 6 SCD implementations, the surrogate key from the dimension is put
into the fact table in place of the natural key when the fact data is loaded into the data
repository.[1]
The surrogate key is selected for a given fact record based on its effective date andthe Start_Date and End_Date from the dimension table. This allows the fact data to be easily
joined to the correct dimension data for the corresponding effective date.
Here is the Supplier table as we created it above using Type 6 Hybrid methodology:
Supplier_Ke
y
Supplier_Cod
e
Supplier_Na
me
Current_Stat
e
Historical_Sta
te
Start_Dat
e
End_Dat
e
Current_Fla
g
123 ABC Acme Supply
Co NY CA
01-Jan-
2000
21-Dec-
2004 N
124 ABC Acme Supply
Co
NY IL 22-Dec-
2004
03-Feb-
2008
N
125 ABC Acme Supply
Co NY NY
04-Feb-
2008
31-Dec-
9999 Y
Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Suppliertable using that key. The following SQL retrieves, for each fact record, the current supplier state
and the state the supplier was located in at the time of the delivery:
SELECT
delivery.delivery_cost,supplier.supplier_name,
supplier.historical_state,supplier.current_state
FROM deliveryINNER JOIN supplier
ON delivery.supplier_key = supplier.supplier_key
Pure Type 6 implementation
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 25/39
Having a Type 2 surrogate key for each time slice can cause problems if the dimension is subject
to change.[1]
A pure Type 6 implementation does not use this, but uses a Surrogate Key for each master data
item (e.g. each unique supplier has a single surrogate key).
This avoids any changes in the master data having an impact on the existing transaction data.
It also allows more options when querying the transactions.
Here is the Supplier table using the pure Type 6 methodology:
Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date
456 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004
456 ABC Acme Supply Co IL 22-Dec-2004 03-Feb-2008
456 ABC Acme Supply Co NY 04-Feb-2008 31-Dec-9999
The following example shows how the query must be extended to ensure a single supplier recordis retrieved for each transaction.
SELECT
supplier.supplier_code,supplier.supplier_state
FROM supplier
INNER JOIN deliveryON supplier.supplier_key = delivery.supplier_key
AND delivery.delivery_date >= supplier.start_dateAND delivery.delivery_date <= supplier.end_date
A fact record with an effective date (Delivery_Date) of August 9, 2001 will be linked toSupplier_Code of ABC, with a Supplier_State of 'CA'. A fact record with an effective date of
October 11, 2007 will also be linked to the same Supplier_Code ABC, but with a Supplier_State
of 'IL'.
Whilst more complex, there are a number of advantages of this approach, including:
1. If there is more than one date on the fact (e.g. Order Date, Delivery Date, Invoice Payment Date)
you can choose which date to use for a query.
2. You can do "as at now", "as at transaction time" or "as at a point in time" queries by changing
the date filter logic.
3. You don't need to reprocess the Fact table if there is a change in the dimension table (e.g.
adding additional fields retrospectively which change the time slices, or if you make a mistake in
the dates on the dimension table you can correct them easily).
4. You can introduce bi-temporal dates in the dimension table.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 26/39
5. You can join the fact to the multiple versions of the dimension table to allow reporting of the
same information with different effective dates, in the same query.
The following example shows how a specific date such as '2012-01-01 00:00:00' (which could be
the current datetime) can be used.
SELECT
supplier.supplier_code,supplier.supplier_state
FROM supplier
INNER JOIN deliveryON supplier.supplier_key = delivery.supplier_key
AND '2012-01-01 00:00:00' >= supplier.start_date
AND '2012-01-01 00:00:00' <= supplier.end_date
Both surrogate and natural key
An alternative implementation is to place both the surrogate key and the natural key into the fact
table.[2] This allows the user to select the appropriate dimension records based on:
the primary effective date on the fact record (above),
the most recent or current information,
any other date associated with the fact record.
This method allows more flexible links to the dimension, even if you have used the Type 2
approach instead of Type 6.
Here is the Supplier table as we might have created it using Type 2 methodology:
Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date Current_Flag
123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004 N
124 ABC Acme Supply Co IL 22-Dec-2004 03-Feb-2008 N
125 ABC Acme Supply Co NY 04-Feb-2008 31-Dec-9999 Y
The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact
record:
SELECT
delivery.delivery_cost,supplier.supplier_name,
supplier.supplier_state
FROM deliveryINNER JOIN supplier
ON delivery.supplier_code = supplier.supplier_codeWHERE supplier.current_flag = 'Y'
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 27/39
If there are multiple dates on the fact record, the fact can be joined to the dimension using
another date instead of the primary effective date. For instance, the Delivery table might have a
primary effective date of Delivery_Date, but might also have an Order_Date associated witheach record.
The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on the Order_Date:
SELECT
delivery.delivery_cost,supplier.supplier_name,supplier.supplier_state
FROM deliveryINNER JOIN supplier
ON delivery.supplier_code = supplier.supplier_code
AND delivery.order_date >= supplier.start_dateAND delivery.order_date <= supplier.end_date
Some cautions:
If the join query is not written correctly, it may return duplicate rows and/or give incorrect
answers.
The date comparison might not perform well.
Some Business Intelligence tools do not handle generating complex joins well.
The ETL processes needed to create the dimension table needs to be carefully designed to
ensure that there are no overlaps in the time periods for each distinct item of reference data.
Combining types
Different SCD Types can be applied to different columns of a table. For example, we can apply
Type 1 to the Supplier_Name column and Type 2 to the Supplier_State column of the same
table, the Supplier table.
Data warehousing is the repository of integrated information data will be extracted from the
heterogeneous sources. Data warehousing architecture contains the different; sources like oracle,flat files and ERP then after it have the staging area and Data warehousing, after that it has the
different Data marts then it have the reports and it also have the ODS - Operation Data Store.
This complete architecture is called the Data warehousing Architecture.
Benefits of data warehousing: => Data warehouses are designed to perform well with aggregate queries running on large
amounts of data.
=> The structure of data warehouses is easier for end users to navigate, understand and query
against unlike the relational databases primarily designed to handle lots of transactions.
=> Data warehouses enable queries that cut across different segments of a company's operation.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 28/39
E.g. production data could be compared against inventory data even if they were originally
stored in different databases with different structures.
=> Queries that would be complex in very normalized databases could be easier to build andmaintain in data warehouses, decreasing the workload on transaction systems.
=> Data warehousing is an efficient way to manage and report on data that is from a variety of
sources, non uniform and scattered throughout a company.=> Data warehousing is an efficient way to manage demand for lots of information from lots ofusers.
=> Data warehousing provides the capability to analyze large amounts of historical data for
nuggets of wisdom that can provide an organization with competitive advantage.
Data modeling is the process of designing a data base model. In this data model data will be
stored in two types of table fact table and dimension table.Fact table contains the transaction data and dimension table contains the master data. Data
mining is process of finding the hidden trends is called the data mining.
A multi-dimensional structure called the data cube. A data abstraction allows one to viewaggregated data from a number of perspectives. Conceptually, the cube consists of a core or base
cuboids, surrounded by a collection of sub-cubes/cuboids that represent the aggregation of the base cuboids along one or more dimensions. We refer to the dimension to be aggregated as themeasure attribute, while the remaining dimensions are known as the feature attributes.
OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysisand querying of large amounts of data. E.g. OLAP technology could provide management with
fast answers to complex queries on their operational data or enable them to analyze their
company's historical data for trends and patterns.
OLTP stands for Online Transaction Processing. OLTP uses normalized tables to quickly record large amounts of transactions while making sure
that these updates of data occur in as few places as possible. Consequently OLTP database aredesigned for recording the daily operations and transactions of a business. E.g. a timecard system
that supports a large production environment must record successfully a large number of updates
during critical periods like lunch hour, breaks, startup and close of work.
Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a
fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension(profit by country, state, city), Product dimension (profit for product1, product2).
MOLAP Cubes: stands for Multidimensional OLAP. In MOLAP cubes the data aggregationsand a copy of the fact data are stored in a multidimensional structure on the Analysis Server
computer. It is best when extra storage space is available on the Analysis Server computer and
the best query performance is desired. MOLAP local cubes contain all the necessary data for
calculating aggregates and can be used offline. MOLAP cubes provide the fastest query responsetime and performance but require additional storage space for the extra copy of data from the fact
table.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 29/39
ROLAP Cubes: stands for Relational OLAP. In ROLAP cubes a copy of data from the fact
table is not made and the data aggregates are stored in tables in the source relational database. A
ROLAP cube is best when there is limited space on the Analysis Server and query performanceis not very important. ROLAP local cubes contain the dimensions and cube definitions but
aggregates are calculated when they are needed. ROLAP cubes requires less storage space than
MOLAP and HOLAP cubes.HOLAP Cubes: stands for Hybrid OLAP. A ROLAP cube has a combination of the ROLAPand MOLAP cube characteristics. It does not create a copy of the source data however, data
aggregations are stored in a multidimensional structure on the Analysis Server computer.
HOLAP cubes are best when storage space is limited but faster query responses are needed.
You can disconnect the report from the catalog to which it is attached by saving the report with a
snapshot of the data.
An active data warehouse provides information that enables decision-makers within an
organization to manage customer relationships nimbly, efficiently and proactively.
Star schema – A single fact table with N number of Dimension, all dimensions will be linked
directly with a fact table. This schema is de-normalized and results in simple join and lesscomplex query as well as faster results.
Snow schema – Any dimensions with extended dimensions are know as snowflake schema,
dimensions maybe interlinked or may have one to many relationship with other tables. This
schema is normalized and results in complex join and very complex query as well as slowerresults.
A concept hierarchy that is a total (or) partial order among attributes in a database schema iscalled a schema hierarchy.
The roll-up operation is also called drill-up operation which performs aggregation on a data cubeeither by climbing up a concept hierarchy for a dimension (or) by dimension reduction.
Indexing is a technique, which is used for efficient data retrieval (or) accessing data in a faster
manner. When a table grows in volume, the indexes also increase in size requiring more storage.
Dimensional Modeling is a design concept used by many data warehouse designers to build theirdata warehouse. In this design model all the data is stored in two types of tables - Facts table and
Dimension table. Fact table contains the facts/measurements of the business and the dimension
table contains the context of measurements i.e., the dimensions on which the facts are
calculated.Dimension modeling is a method for designing data warehouse.
Three types of modeling are there 1. Conceptual modeling
2. Logical modeling3. Physical modeling
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 30/39
Data Transformation Services is a set of tools available in SQL server that helps to extract,
transform and consolidate data. This data can be from different sources into a single or multiple
destinations depending on DTS connectivity. To perform such operations DTS offers a set oftools. Depending on the business needs, a DTS package is created. This package contains a list
of tasks that define the work to be performed on, transformations to be done on the data objects.
Import or Export data: DTS can import data from a text file or an OLE DB data source into aSQL server or vice versa.
Transform data: DTS designer interface also allows to select data from a data source
connection, map the columns of data to a set of transformations, and send the transformed data to
a destination connection. For parameterized queries and mapping purposes, Data driven querytask can be used from the DTS designer.
Consolidate data : the DTS designer can also be used to transfer indexes, views, logins, triggers
and user defined data. Scripts can also be generated for the sane For performing these tasks, a
valid connection(s) to its source and destination data and to any additional data sources, such aslookup tables must be established.
Data mining extension is based on the syntax of SQL. It is based on relational concepts andmainly used to create and manage the data mining models. DMX comprises of two types of
statements: Data definition and Data manipulation. Data definition is used to define or createnew models, structures.
Example: CREATE MINING SRUCTURE
CREATE MINING MODELData manipulation is used to manage the existing models and structures.
Example: INSERT INTO
SELECT FROM .CONTENT (DMX)
SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the
patterns and relationships of the data. This also helps in an enhanced analysis. The Add-in calledas Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict
results.
Data mining is used to examine or explore the data using queries. These queries can be fired on
the data warehouse. Explore the data in data mining helps in reporting, planning strategies,
finding meaningful patterns etc. it is more commonly used to transform large amount of data intoa meaningful form. Data here can be facts, numbers or any real time information like sales
figures, cost, meta data etc. Information would be the patterns and the relationships amongst the
data that can provide information.
Sequence clustering algorithm collects similar or related paths, sequences of data containing
events. The data represents a series of events or transitions between states in a dataset like a
series of web clicks. The algorithm will examine all probabilities of transitions and measure thedifferences, or distances, between all the possible sequences in the data set. This helps it to
determine which sequence can be the best for input for clustering.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 31/39
E.g. Sequence clustering algorithm may help finding the path to store a product of ―similar‖
nature in a retail ware house.
Association algorithm is used for recommendation engine that is based on a market based
analysis. This engine suggests products to customers based on what they bought earlier. The
model is built on a dataset containing identifiers. These identifiers are both for individual casesand for the items that cases contain. These groups of items in a data set are called as an item set.
The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT
parameter is used any associated items that appear into an item set.
Time series algorithm can be used to predict continuous values of data. Once the algorithm is
skilled to predict a series of data, it can predict the outcome of other series. The algorithmgenerates a model that can predict trends based only on the original dataset. New data can also be
added that automatically becomes a part of the trend analysis.
E.g. Performance one employee can influence or forecast the profit.
Naïve Bayes Algorithm is used to generate mining models. These models help to identifyrelationships between input columns and the predictable columns. This algorithm can be used in
the initial stage of exploration. The algorithm calculates the probability of every state of eachinput column given predictable columns possible states. After the model is made, the results can
be used for exploration and making predictions.
A decision tree is a tree in which every node is either a leaf node or a decision node. This tree
takes an input an object and outputs some decision. All Paths from root node to the leaf node are
reached by either using AND or OR or BOTH. The tree is constructed using the regularities ofthe data. The decision tree is not affected by Automatic Data Preparation.
Models in Data mining help the different algorithms in decision making or pattern matching. Thesecond stage of data mining involves considering various models and choosing the best one
based on their predictive performance.
Data mining helps analysts in making faster business decisions which increases revenue with lower
costs.
• Data mining helps to understand, explore and identify patterns of data.
• Data mining automates process of finding predictive information in large databases.
• Helps to identify previously hidden patterns.
The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of
unnecessary NULL values of columns. This usually happens when the size of the database gets too large.
Data warehousing is merely extracting data from different sources, cleaning the data and storing
it in the warehouse. Where as data mining aims to examine or explore the data using queries.These queries can be fired on the data warehouse. Explore the data in data mining helps in
reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects andemployees. Using Data mining, one can use this data to generate different reports like profits
generated etc.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 32/39
History
In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He
defined intelligence as: "the ability to apprehend the interrelationships of presented facts in sucha way as to guide action towards a desired goal."
[3]
Business intelligence as it is understood today is said to have evolved from the decision support
systems that began in the 1960s and developed throughout the mid-1980s. DSS originated in the
computer-aided models created to assist with decision making and planning. From DSS, data
warehouses, Executive Information Systems, OLAP and business intelligence came into focus beginning in the late 80s.
In 1989, Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as anumbrella term to describe "concepts and methods to improve business decision making by using
fact-based support systems."[4]
It was not until the late 1990s that this usage was widespread.[5]
Business intelligence and data warehousing
Often BI applications use data gathered from a data warehouse or a data mart. A data warehouse
is a copy of transactional data that facilitates decision support. However, not all data warehousesare used for business intelligence, nor do all business intelligence applications require a data
warehouse.
To distinguish between the concepts of business intelligence and data warehouses, Forrester
Research often defines business intelligence in one of two ways:
Using a broad definition: "Business Intelligence is a set of methodologies, processes,architectures, and technologies that transform raw data into meaningful and useful information
used to enable more effective strategic, tactical, and operational insights and decision-making."[6]
When using this definition, business intelligence also includes technologies such as data
integration, data quality, data warehousing, master data management, text and content analytics,
and many others that the market sometimes lumps into the Information Management segment.Therefore, Forrester refers to data preparation and data usage as two separate, but closely linked
segments of the business intelligence architectural stack.
Forrester defines the latter, narrower business intelligence market as, "...referring to just the top
layers of the BI architectural stack such as reporting, analytics and dashboards."[7]
Business intelligence and business analytics
Thomas Davenport argues that business intelligence should be divided into querying, reporting,
OLAP, an "alerts" tool, and business analytics. In this definition, business analytics is the subsetof BI based on statistics, prediction, and optimization.
[8]
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 33/39
Applications in an enterprise
Business intelligence can be applied to the following business purposes, in order to drive
business value.[citation needed ]
1. Measurement – program that creates a hierarchy of performance metrics (see also MetricsReference Model) and benchmarking that informs business leaders about progress
towards business goals ( business process management).
2. Analytics – program that builds quantitative processes for a business to arrive at optimal
decisions and to perform business knowledge discovery. Frequently involves: datamining, process mining, statistical analysis, predictive analytics, predictive modeling,
business process modeling, complex event processing and prescriptive analytics.
3. Reporting/enterprise reporting – program that builds infrastructure for strategic reporting
to serve the strategic management of a business, not operational reporting. Frequentlyinvolves data visualization, executive information system and OLAP.
4. Collaboration/collaboration platform – program that gets different areas (both inside and
outside the business) to work together through data sharing and electronic datainterchange.
5. Knowledge management – program to make the company data driven through strategies
and practices to identify, create, represent, distribute, and enable adoption of insights andexperiences that are true business knowledge. Knowledge management leads to learning
management and regulatory compliance.
In addition to above, business intelligence also can provide a pro-active approach, such as
ALARM function to alert immediately to end-user. There are many types of alerts, for example
if some business value exceeds the threshold value the color of that amount in the report will turn
RED and the business analyst is alerted. Sometimes an alert mail will be sent to the user as well.
This end to end process requires data governance, which should be handled by the expert.[citationneeded ]
Prioritization of business intelligence projects
It is often difficult to provide a positive business case for business intelligence initiatives andoften the projects must be prioritized through strategic initiatives. Here are some hints to increase
the benefits for a BI project.
As described by Kimball[9]
you must determine the tangible benefits such as eliminatedcost of producing legacy reports.
Enforce access to data for the entire organization.[10]
In this way even a small benefit,
such as a few minutes saved, makes a difference when multiplied by the number ofemployees in the entire organization.
As described by Ross, Weil & Roberson for Enterprise Architecture,[11]
consider letting
the BI project be driven by other business initiatives with excellent business cases. Tosupport this approach, the organization must have enterprise architects who can identify
suitable business projects.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 34/39
Use a structured and quantitative methodology to create defensible prioritization in line
with the actual needs of the organization, such as a weighted decision matrix.[12]
Success factors of implementation
Before implementing a BI solution, it is worth taking different factors into consideration before proceeding. According to Kimball et al., these are the three critical areas that you need to assess
within your organization before getting ready to do a BI project:[13]
1. The level of commitment and sponsorship of the project from senior management
2. The level of business need for creating a BI implementation
3. The amount and quality of business data available.
Business sponsorship
The commitment and sponsorship of senior management is according to Kimball et al., the most
important criteria for assessment.[14] This is because having strong management backing helpsovercome shortcomings elsewhere in the project. However, as Kimball et al. state: ―even the
most elegantly designed DW/BI system cannot overcome a lack of business [management]
sponsorship‖.[15]
It is important that personnel who participate in the project have a vision and an idea of the benefits and drawbacks of implementing a BI system. The best business sponsor should have
organizational clout and should be well connected within the organization. It is ideal that the
business sponsor is demanding but also able to be realistic and supportive if the implementation
runs into delays or drawbacks. The management sponsor also needs to be able to assumeaccountability and to take responsibility for failures and setbacks on the project. Support from
multiple members of the management ensures the project does not fail if one person leaves thesteering group. However, having many managers work together on the project can also mean thatthere are several different interests that attempt to pull the project in different directions, such as
if different departments want to put more emphasis on their usage. This issue can be countered
by an early and specific analysis of the business areas that benefit the most from the
implementation. All stakeholders in project should participate in this analysis in order for themto feel ownership of the project and to find common ground.
Another management problem that should be encountered before start of implementation is if the business sponsor is overly aggressive. If the management individual gets carried away by the
possibilities of using BI and starts wanting the DW or BI implementation to include several
different sets of data that were not included in the original planning phase. However, since extraimplementations of extra data may add many months to the original plan, it's wise to make surethe person from management is aware of his actions.
Business needs
Because of the close relationship with senior management, another critical thing that must be
assessed before the project begins is whether or not there is a business need and whether there is
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 35/39
a clear business benefit by doing the implementation.[16]
The needs and benefits of the
implementation are sometimes driven by competition and the need to gain an advantage in the
market. Another reason for a business-driven approach to implementation of BI is the acquisitionof other organizations that enlarge the original organization it can sometimes be beneficial to
implement DW or BI in order to create more oversight.
Companies that implement BI are often large, multinational organizations with diverse
subsidiaries.[17]
A well-designed BI solution provides a consolidated view of key business data
not available anywhere else in the organization, giving management visibility and control overmeasures that otherwise would not exist.
Amount and quality of available data
Without good data, it does not matter how good the management sponsorship or business-driven
motivation is. Without proper data, or with too little quality data, any BI implementation fails.
Before implementation it is a good idea to do data profiling. This analysis identifies the ―content,
consistency and structure [..]‖[16]
of the data. This should be done as early as possible in the process and if the analysis shows that data is lacking, put the project on the shelf temporarily
while the IT department figures out how to properly collect data.
When planning for business data and business intelligence requirements, it is always advisable to
consider specific scenarios that apply to a particular organization, and then select the businessintelligence features best suited for the scenario.
Often, scenarios revolve around distinct business processes, each built on one or more datasources. These sources are used by features that present that data as information to knowledge
workers, who subsequently act on that information. The business needs of the organization for
each business process adopted correspond to the essential steps of business intelligence. Theseessential steps of business intelligence include but are not limited to:
1. Go through business data sources in order to collect needed data
2. Convert business data to information and present appropriately3. Query and analyze data
4. Act on those data collected
The quality aspect in business intelligence should cover all the process from the source data to
the final reporting. At each step, the quality gates are different:
1.
Source Data:o Data Standardization: make data comparable (same unit, same pattern..)
o Master Data Management: unique referential2. Operational Data Store (ODS):
o Data Cleansing: detect & correct inaccurate data
o Data Profiling: check inappropriate value, null/empty3. Datawarehouse:
o Completeness: check that all expected data are loaded
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 36/39
o Referential integrity: unique and existing referential over all sources
o Consistency between sources: check consolidated data vs sources
4. Reporting:o Uniqueness of indicators: only one share dictionary of indicators
o Formula accurateness: local reporting formula should be avoid or checked
User aspect
Some considerations must be made in order to successfully integrate the usage of business
intelligence systems in a company. Ultimately the BI system must be accepted and utilized bythe users in order for it to add value to the organization.
[18][19] If the usability of the system is
poor, the users may become frustrated and spend a considerable amount of time figuring out how
to use the system or may not be able to really use the system. If the system does not add value to
the users´ mission, they simply don't use it.[19]
To increase user acceptance of a BI system, it can be advisable to consult business users at an
early stage of the DW/BI lifecycle, for example at the requirements gathering phase.[18] This can provide an insight into the business process and what the users need from the BI system. There
are several methods for gathering this information, such as questionnaires and interview sessions.
When gathering the requirements from the business users, the local IT department should also be
consulted in order to determine to which degree it is possible to fulfill the business's needs based
on the available data.[18]
Taking on a user-centered approach throughout the design and development stage may further
increase the chance of rapid user adoption of the BI system.[19]
Besides focusing on the user experience offered by the BI applications, it may also possiblymotivate the users to utilize the system by adding an element of competition. Kimball
[18]
suggests implementing a function on the Business Intelligence portal website where reports on
system usage can be found. By doing so, managers can see how well their departments are doing
and compare themselves to others and this may spur them to encourage their staff to utilize theBI system even more.
In a 2007 article, H. J. Watson gives an example of how the competitive element can act as an
incentive.[20]
Watson describes how a large call centre implemented performance dashboards for
all call agents, with monthly incentive bonuses tied to performance metrics. Also, agents could
compare their performance to other team members. The implementation of this type of
performance measurement and competition significantly improved agent performance.
BI chances of success can be improved by involving senior management to help make BI a partof the organizational culture, and by providing the users with necessary tools, training, andsupport.
[20] Training encourages more people to use the BI application.
[18]
Providing user support is necessary to maintain the BI system and resolve user problems.[19]
User
support can be incorporated in many ways, for example by creating a website. The website
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 37/39
should contain great content and tools for finding the necessary information. Furthermore,
helpdesk support can be used. The help desk can be manned by power users or the DW/BI
project team.[18]
BI Portals
A Business Intelligence portal (BI portal) is the primary access interface for Data Warehouse
(DW) and Business Intelligence (BI) applications. The BI portal is the users first impression of
the DW/BI system. It is typically a browser application, from which the user has access to all the
individual services of the DW/BI system, reports and other analytical functionality. The BI portalmust be implemented in such a way that it is easy for the users of the DW/BI application to call
on the functionality of the application.[21]
The BI portal's main functionality is to provide a navigation system of the DW/BI application.
This means that the portal has to be implemented in a way that the user has access to all the
functions of the DW/BI application.
The most common way to design the portal is to custom fit it to the business processes of the
organization for which the DW/BI application is designed, in that way the portal can best fit theneeds and requirements of its users.
[22]
The BI portal needs to be easy to use and understand, and if possible have a look and feel similarto other applications or web content of the organization the DW/BI application is designed for
(consistency).
The following is a list of desirable features for web portals in general and BI portals in particular:
UsableUser should easily find what they need in the BI tool.
Content Rich
The portal is not just a report printing tool, it should contain more functionality such asadvice, help, support information and documentation.
Clean
The portal should be designed so it is easily understandable and not over complex as toconfuse the users
Current
The portal should be updated regularly.
Interactive
The portal should be implemented in a way that makes it easy for the user to use itsfunctionality and encourage them to use the portal. Scalability and customization give the
user the means to fit the portal to each user.
Value OrientedIt is important that the user has the feeling that the DW/BI application is a valuable
resource that is worth working on.
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 38/39
Marketplace
There are a number of business intelligence vendors, often categorized into the remaining
independent "pure-play" vendors and consolidated "megavendors" that have entered the marketthrough a recent trend
[23] of acquisitions in the BI industry.
[24]
Some companies adopting BI software decide to pick and choose from different product
offerings (best-of-breed) rather than purchase one comprehensive integrated solution (full-
service).[25]
Industry-specific
Specific considerations for business intelligence systems have to be taken in some sectors such
as governmental banking regulations. The information collected by banking institutions andanalyzed with BI software must be protected from some groups or individuals, while being fully
available to other groups or individuals. Therefore BI solutions must be sensitive to those needs
and be flexible enough to adapt to new regulations and changes to existing law.
Semi-structured or unstructured data
Businesses create a huge amount of valuable information in the form of e-mails, memos, notes
from call-centers, news, user groups, chats, reports, web-pages, presentations, image-files, video-
files, and marketing material and news. According to Merrill Lynch, more than 85% of all
business information exists in these forms. These information types are called either semi- structured or unstructured data. However, organizations often only use these documents once.
[26]
The management of semi-structured data is recognized as a major unsolved problem in theinformation technology industry.
[27] According to projections from Gartner (2003), white collar
workers spend anywhere from 30 to 40 percent of their time searching, finding and assessing
unstructured data. BI uses both structured and unstructured data, but the former is easy to search,and the latter contains a large quantity of the information needed for analysis and decision
making.[27][28]
Because of the difficulty of properly searching, finding and assessing unstructured
or semi-structured data, organizations may not draw upon these vast reservoirs of information,which could influence a particular decision, task or project. This can ultimately lead to poorly
informed decision making.[26]
Therefore, when designing a business intelligence/DW-solution, the specific problems associated
with semi-structured and unstructured data must be accommodated for as well as those for thestructured data.[28]
Unstructured data vs. semi-structured data
Unstructured and semi-structured data have different meanings depending on their context. In thecontext of relational database systems, unstructured data cannot be stored in predictably ordered
columns and rows. One type of unstructured data is typically stored in a BLOB (binary large
8/13/2019 ETL Testing (Extract, Transform, And Load)
http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 39/39
object), a catch-all data type available in most relational database management systems.
Unstructured data may also refer to irregularly or randomly repeated column patterns that vary
from row to row within each file or document.
Many of these data types, however, like e-mails, word processing text files, PPTs, image-files,
and video-files conform to a standard that offers the possibility of metadata. Metadata caninclude information such as author and time of creation, and this can be stored in a relational
database. Therefore it may be more accurate to talk about this as semi-structured documents or
data,[27]
but no specific consensus seems to have been reached.
Unstructured data can also simply be the knowledge that business users have about future
business trends. Business forecasting naturally aligns with the BI system because business usersthink of their business in aggregate terms. Capturing the business knowledge that may only exist
in the minds of business users provides some of the most important data points for a complete BI
solution.
Problems with semi-structured or unstructured data
There are several challenges to developing BI with semi-structured data. According to Inmon & Nesavich,
[29] some of those are:
1. Physically accessing unstructured textual data – unstructured data is stored in a hugevariety of formats.
2. Terminology – Among researchers and analysts, there is a need to develop a standardized
terminology.3. Volume of data – As stated earlier, up to 85% of all data exists as semi-structured data.
Couple that with the need for word-to-word and semantic analysis.
4.
Searchability of unstructured textual data – A simple search on some data, e.g. apple,results in links where there is a reference to that precise search term. (Inmon & Nesavich,2008)
[29] gives an example: ―a search is made on the term felony. In a simple search, the
term felony is used, and everywhere there is a reference to felony, a hit to an unstructured
document is made. But a simple search is crude. It does not find references to crime,arson, murder, embezzlement, vehicular homicide, and such, even though these crimes
are types of felonies.‖
The use of metadata
To solve problems with searchability and assessment of data, it is necessary to know something
about the content. This can be done by adding context through the use of metadata.[26]
Manysystems already capture some metadata (e.g. filename, author, size, etc.), but more useful would
be metadata about the actual content – e.g. summaries, topics, people or companies mentioned.
Two technologies designed for generating metadata about content are automatic categorization
and information extraction.