data warehouse concepts by ramesh
TRANSCRIPT
-
8/3/2019 Data Warehouse Concepts by Ramesh
1/49
8/24/20118/24/2011 11
Ramesh KutumbakaRamesh Kutumbaka
-
8/3/2019 Data Warehouse Concepts by Ramesh
2/49
OLTP Systems are meant for day-to-day business
operations, does not maintain history data and are
highly normalized.
You can Query on an operational systems forinformation about specific instances of business objects.
For example:For example:
You may want just the name and address of a single
customer or you may just need to look at a single
DW Provides Insight into all Components of Enterprise Business
8/24/20118/24/2011 22
invoice and the items billed on that single invoice. You do not expect a particular query to run across
different Databases, internal data, external data etc.,
Reasons are:Reasons are:
A term like an Account may have different meaning indifferent systems.
Need to standardize and transform the disparate data
from the various production systems, convert the data,
and integrate the pieces.
-
8/3/2019 Data Warehouse Concepts by Ramesh
3/49
SoSo. What we need to do ?
Which means that there is no conformance of data among
the various operational or OLTP Systems of an enterprise.
Decision Maker
Contd
8/24/20118/24/2011 33
Building DW/DSS/OLAP/IDS is necessary.
We dont need Systems that are only pretty good atTransactional Processing and not pretty good at Querying.
Ralph KimballRalph KimballRalph KimballRalph KimballRalph KimballRalph KimballRalph KimballRalph Kimball
-
8/3/2019 Data Warehouse Concepts by Ramesh
4/49
Data Warehouse is an information Delivery System (IDS)
for strategic Decisions. Basically it is a Decision Support
System (DSS)
What we need to do to build the IDS/DSS/DW?What we need to do to build the IDS/DSS/DW?
Integrate all the historic data from the various
8/24/20118/24/2011 44
opera ona ys ems, com ne s n erna a a w
any relevant data from outside sources, and pull
them together in to the DW.
Resolve any conflicts in the data the way dataresides in different Sources Systems and transform,
derive and integrate the data content into a format
suitable for providing information to the various
category of users.
Finally , implement the IDS
DWDW
SS1SS1
SS2SS2
SS3SS3
SS4SS4 SS5SS5
SS6SS6
SS8SS8
SS7SS7
-
8/3/2019 Data Warehouse Concepts by Ramesh
5/49
We need to have different components or building blocks.
8/24/20118/24/2011
55
ese u ng oc s are arrange oge er n e mos
optimal way to serve the intended purpose.
Building blocks are arranged in a suitable Architecture.
-
8/3/2019 Data Warehouse Concepts by Ramesh
6/49
Bill InmonBill Inmon
Bill Inmon is universally recognized as the "father of thedata warehouse."
8/24/20118/24/2011 66
Inmon defined "A DW is a subject-oriented, integrated,time-variant and non-volatile collection of data in support
of management's decision making process".
-
8/3/2019 Data Warehouse Concepts by Ramesh
7/49
Ralph KimballRalph Kimball
Ralph is a leading proponent of the dimensional approach to designing large data warehouses.
8/24/20118/24/2011 77
This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.
A Data Warehouse is "a copy of transaction data specifically structured for query and analysis".
-
8/3/2019 Data Warehouse Concepts by Ramesh
8/49
Bill Inmon's paradigm:Bill Inmon's paradigm:
Data warehouse is one part of the overall businessintelligence system.
8/24/20118/24/2011 88
An enterprise has one data warehouse, and data marts sourcetheir information from the data warehouse.
An enterprise has one data warehouse, and data martssource their information from the data warehouse.
-
8/3/2019 Data Warehouse Concepts by Ramesh
9/49
Data warehouse is the business of all data marts
within the enterprise.
Ralph Kimballs paradigm:Ralph Kimballs paradigm:
8/24/20118/24/2011 99
Information is always stored in the dimensional model.
An enterprise has one data warehouse, and data marts
source their information from the data warehouse .
DWDW
-
8/3/2019 Data Warehouse Concepts by Ramesh
10/49
ThereThere isis nono rightright oror wrongwrong betweenbetween thesethese twotwo ideas,ideas,asas theythey representrepresent differentdifferent datadata warehousingwarehousingphilosophiesphilosophies..
InIn reality,reality, thethe datadata warehousewarehouse inin mostmost enterprisesenterprisesareare closercloser toto RalphRalph Kimball'sKimball's ideaidea..
8/24/20118/24/2011 1010
ThisThis isis becausebecause mostmost datadata warehouseswarehouses startedstarted outout asasaa departmentaldepartmental effort,effort, andand hencehence theythey originatedoriginated asasaa datadata martmart..
OnlyOnly whenwhen moremore datadata martsmarts areare builtbuilt laterlater dodo theytheyevolveevolve intointo aa datadata warehousewarehouse..
-
8/3/2019 Data Warehouse Concepts by Ramesh
11/49
Sean Kelly is another leading data warehousing practitioner.
The data in the Data warehouse is:
Separate
Available
8/24/20118/24/2011 1111
Integrated
Time Stamped
Subject Oriented
Nonvolatile
Accessible
-
8/3/2019 Data Warehouse Concepts by Ramesh
12/49
For proper decision making, we need to pull together all
the relevant data from the various applications.
The data in the data warehouse comes from several
operational systems.
Source data are in different databases, files, and data
segments.
DWDW
SS2SS2
SS3SS3
SS4SS4SS5SS5
SS6SS6
SS7SS7
8/24/20118/24/20111212
These are disparate applications, so the operational
platforms and operating systems could be different.
The file layouts, characters code representations, and field
naming conventions all could be different.
In addition to data from internal operational systems, for
many enterprises, data from outside sources is likely to
very important and this is one more variation in the mix of
source data for a data warehouse.
SS1SS1 SS8SS8
-
8/3/2019 Data Warehouse Concepts by Ramesh
13/49
Subject AreaSubject AreaSavings Account
Checking Account Account
Naming conventions would be
different.
From these 3 different Source systems
Attributes for data items could be
different.
8/24/20118/24/2011 1313
Loans Account
Integration of different Source SystemsIntegration of different Source Systems
Account number in the saving
account application could be
eight bytes long, but only six
bytes in the checking Account
application.
Before moving the Data into the data warehouse, you have to go through a process of transformation,consolidation, and integration of the source data.
-
8/3/2019 Data Warehouse Concepts by Ramesh
14/49
Example:Example:
In order to store data, over the years, many application designers in each branch have made their individualdecisions as to how an application and database should be built.
So source systems will be different in naming conventions, variable measurements, encoding structures,
8/24/20118/24/2011 1414
and physical attributes of data.
Consider a bank that has got several branches in several countries, has millions of customers and the lines ofbusiness of the enterprise are savings, and loans.
The following example explains how the data is integrated from source systems to target systems.
-
8/3/2019 Data Warehouse Concepts by Ramesh
15/49
System NameSystem Name Attribute NameAttribute Name Column NameColumn Name DatatypeDatatype ValuesValues
Source System1
Customer ApplicationDate
CUSTOMER_APPLICATION_DATE
NUMERIC(8,0)
11012005
Source System2
Customer ApplicationDate
CUST_APPLICATION_DATE DATE 11012005
Source System3
Application Date APPLICATION_DATE DATE01NOV200
5
8/24/20118/24/2011 1515
This inconsistency in data can be avoided by integrating the data into a data warehouse with goodstandards.
In the aforementioned example, attribute name, column name, data type and values are entirelydifferent from one source system to another.
-
8/3/2019 Data Warehouse Concepts by Ramesh
16/49
TargetSystem
Attribute Name Column Name Datatype Values
Record #1Customer Application
DateCUSTOMER_APPLICATION
_DATEDATE 01112005
Record #2Customer Application
Date
CUSTOMER_APPLICATION
_DATEDATE 01112005
Record #3Customer Application
DateCUSTOMER_APPLICATION
_DATEDATE 01112005
8/24/20118/24/2011 1616
This is how data from various source systems is integrated and accurately stored into the datawarehouse.
In the above example of target data, attribute names, column names, and data types are consistentthroughout the target system.
-
8/3/2019 Data Warehouse Concepts by Ramesh
17/49
In online Transaction Processing Systems (OLTPS):
We capture and store the data by individual Application
Example: Order ProcessingExample: Order Processing
8/24/20118/24/2011 1717
We capture and store the data related to this particular
application.
, ,
customers credit, and assigning the order for shipment.
Here, we will have data about individual orders,
customers, stock status, and detailed transactions, but
all of these are structured around the processing of
orders.
-
8/3/2019 Data Warehouse Concepts by Ramesh
18/49
Order Processing Consumer Loans
SalesSales ProductProduct
Operational ApplicationsOperational Applications Data Warehouse SubjectsData Warehouse Subjects
8/24/20118/24/2011 1818
In Data Warehouse, Data is not stored by operational applications, but by business subjects
Consumer Loans
Claims Processing
Account Receivables
Savings Accounts
CustomerCustomer
ClaimsClaims
AccountAccount
PolicyPolicy
-
8/3/2019 Data Warehouse Concepts by Ramesh
19/49
In OLTP Systems, the stored data contains the currentvalues
ForFor Examples:Examples:
The balance is the current outstanding balancein the customers account
The Status of an Order is the Current Status ofthe Order
8/24/20118/24/2011 1919
Of course, in OLTP Systems, we do store some pasttransactions, but essentially, OLTP Systems reflectcurrent information because these systems support day-to-day current operations
Where As DW is time variant database, supports businesscommunity and comparing business with different timeperiods.
When an analyst in a grocery chain wants topromote two or more products together, thatanalyst wants sales of the selected productsover a number of past quarters
-
8/3/2019 Data Warehouse Concepts by Ramesh
20/49
Data warehouse is a non-volatile database. Once data entered
into the Data warehouse it should not change.
Data from the OLTP Systems are moved into the DW at
Specific intervals.
Depending on the business requirements, these data
movements take place twice a day, once a week, or once in
OLTPDatabases DWDWLoads
8/24/20118/24/2011 2020
.
In fact, in a typical DW, Data movements to different data sets
may takes place at different frequencies.
The changes to the attributes of the product may be moved
once a week.
Any revisions to geographical setup maybe moved once a
month.
The units of sales may be moved once a day.
Add
OLTP System Applications
R
ead
-
8/3/2019 Data Warehouse Concepts by Ramesh
21/49
In an OLTP Systems, the data is captured at the lowest level of thedetail.
For Example:For Example:
In an order Entry System, the quantity order is capturedand stored at the units level of a product per order receivedfrom the customer.
8/24/20118/24/2011 2121
When ever you need summary data, you add up theindividual transactions.
If you are looking for units of a product ordered this month,you read all the orders entered for the entire month for thatproduct and add up.
NoteNote::
We do not keep summary data in the OLTP/operationalWe do not keep summary data in the OLTP/operational Systems.Systems.
-
8/3/2019 Data Warehouse Concepts by Ramesh
22/49
When a user queries the DW for analysis, he/she usually
starts by looking at summary data.
The user may start with a total sale units of a product in
an entire region.
SummaryData
DetailedData
Data WarehouseData Warehouse
Aggregated/SummaryData
8/24/20118/24/2011 2222
Then the user may want to look at the break down bystates in the region.
The next step may be the examination of sale units by
the next level of individual stores.
Frequently, the analysis starts at a high level and moves
down to lower levels of detail .
-
8/3/2019 Data Warehouse Concepts by Ramesh
23/49
There are basically two different approaches for building DW
Top-down approach
Bottomup approach
TopTop--down approach :down approach :
8/24/20118/24/2011 2323
Overall DW feeding dependent data marts
Data will be extracted from the OLTP Systems
Data will be transformed, clean, integrate, and keep the data in the DW
BottomBottomup approachup approach
Departmental or Data Marts will be built first
Several Departmental or local Data Marts combining into a DW
-
8/3/2019 Data Warehouse Concepts by Ramesh
24/49
Top-down Approach Bottom-up Approach
DWDW
Disparate Source SystemsDisparate Source Systems
8/24/20118/24/2011 2424
DWDW
DM1 DM2 DM3
DM1 DM2 DM3
Disparate Source SystemsDisparate Source Systems
-
8/3/2019 Data Warehouse Concepts by Ramesh
25/49
A data warehouse is a relational/multidimensional database that is designed for query and analysis
rather than transaction processing.
A data warehouse usually contains historical data that is derived from transaction data.
It separates analysis workload from transaction workload and enables a business to consolidate
8/24/20118/24/2011 2525
data from several sources.
In addition to a relational/multidimensional database, a data warehouse environment often consists
of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the
process of gathering data and delivering it to business users .
There are three types of data warehouses.
-
8/3/2019 Data Warehouse Concepts by Ramesh
26/49
3.3. Data MartData Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit
1.1. EnterpriseEnterprise Data WarehouseData Warehouse -An enterprise data warehouse provides a central database for decisionSupport throughout the enterprise.
2.2. ODSODS (Operational Data Store)(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real
enterprise data warehouse, data is refreshed in near real time and
used for routine business activity.
8/24/20118/24/2011 2626
.
Data warehouses and data marts are built on dimensional data modeling where fact tables are
connected with dimension tables.
This is most useful for users to access data since a database can be visualized as a cube of several
dimensions.
A data warehouse provides an opportunity for slicing and dicing that cube along each of its
dimensions.
-
8/3/2019 Data Warehouse Concepts by Ramesh
27/49
A data mart is a subset of data warehouse that is designed for aparticular line of business, such as sales, marketing, or finance.
In a dependent data mart, data can be derived from an enterprise-widedata warehouse.
DW
DM1 DM2 DM3 DM4
8/24/20118/24/2011 2727
In an independent data mart, data can be collected directly from
sources.
DW
DM1 DM2 DM3 DM4
-
8/3/2019 Data Warehouse Concepts by Ramesh
28/49
Building Blocks or Components of DW Architecture
SSOOUU
External Data
Production Data
DDAATT
AA
DDAATT
MetadataMetadataMulti-Dim
DataDBs
Information Delivery
Data Mining
8/24/20118/24/2011 2828
CCEE
DDAA
TTAA
Internal Data
Archived Data
SSTT
AAGGII
NNGG
AA
SSTTOORRAA
GGEE
DBMSDBMS
DWDW
DM1 DM2
OLAP
Report/Query
Architecture is the proper Arrangements of the Components
-
8/3/2019 Data Warehouse Concepts by Ramesh
29/49
Source data coming to the DW may be groped into four broad categories as shown in theprevious slide.
External Data
8/24/20118/24/2011 2929
Internal Data
Archive Data
-
8/3/2019 Data Warehouse Concepts by Ramesh
30/49
Most Executives depend on data from external sources for a high percentage of the information they
use.
They use statistics relating to their industry produced by external agencies.
8/24/20118/24/2011 3030
They use market share data of competitors.
They use standard values of financial indicators for their business to check on their business tocheck on their performance.
-
8/3/2019 Data Warehouse Concepts by Ramesh
31/49
The DW of a car rental company contains data on the current production schedules of the leadingautomobile manufactures. This external data in the DW helps the car rental company plan for theirfleet management.
The purpose served by such external data sources cannot be fulfilled by the data available within theOrganization.
8/24/20118/24/2011 3131
Usually, data from outside sources do not conform to your formats.
We have to device conversions of data into your internal formats and data types
Some sources may provide information at regular, stipulates intervals, or may give you data on request
We need to accommodate the variations
-
8/3/2019 Data Warehouse Concepts by Ramesh
32/49
This type of data comes from various OLTP or operational systems of the enterprise.
While dealing with this data, you come across many variations in the data formats.
You also notice that the data resides on different hardware platforms.
8/24/20118/24/2011 3232
The Database is supported by different database systems and operating systems.
This the data from many vertical applications.
The significant and disturbing characteristic of production data is disparity.
Need to standardize and transform the disparate data from the various production systems, convertthe data, and integrate the pieces into useful data for storage in the DW.
-
8/3/2019 Data Warehouse Concepts by Ramesh
33/49
private spreadsheets
Documents
The following data is internal data, parts of which may be required in DW
8/24/20118/24/2011 3333
Customer profiles
sometimes even departmental databases
-
8/3/2019 Data Warehouse Concepts by Ramesh
34/49
OLTP or Operational Systems are primarily intended to run the current business
In OLTP or Operational Systems, the old data periodically will be taken and store it in the archived files
DW keeps historical snapshots of data for analysis over time
8/24/20118/24/2011 3434
For getting historical data, need to connect to the Archived Data Sets
Depending on the Business Requirements you have to include sufficient historical data in the DW
This type of data is useful for discerning patterns and analyzed
-
8/3/2019 Data Warehouse Concepts by Ramesh
35/49
The extracted data from various disparate Source Systems and external data need to be changed, converted,combined, reduplicate and made it ready in a format that is suitable to be stored for querying and analysis
There three major functions need to be performed for getting the data ready in the Staging Area (SA)
8/24/20118/24/2011 3535
Extract the Data from Source SystemsExtract the Data from Source Systems
Transforms the DataTransforms the Data
Load the Data into the DWLoad the Data into the DW
-
8/3/2019 Data Warehouse Concepts by Ramesh
36/49
The data Storage for the DW is a separate Repository
The Data in the DW in Structures suitable for analysis
In DW any of the following Data Modeling can be used
8/24/20118/24/2011 3636
Star SchemaStar Schema
Snowflake SchemaSnowflake Schema
Star flakeStar flake
DWs are ReadDWs are Readonly Data Repositoriesonly Data Repositories
-
8/3/2019 Data Warehouse Concepts by Ramesh
37/49
erySystem
erySystem
Ad hoc Reports
Complex Queries
MD Analysis
OnlineOnline
IntranetIntranet
IDS component includes different methods of informationdelivery.
Ad hoc reports are predefined reports primarily meant forthe novice and casual users.
Provision for complex Queries, Multidimensional (MD)analysis.
8/24/20118/24/2011 3737
Inf
ormationDeli
Inf
ormationDeli
Statistical Analysis
EIS Feed
Data Mining
InternetInternet
EE--mailmail
Statistical Analysis cater to the needs of the businessAnalysts and Power Users.
Information fed into the Executive Information Systems(EIS) is meant for the Senior Executives and high-levelmanagers.
Some DW also provide Data to the Data-Mining Applicationsare knowledge discovery Systems where the miningalgorithms help you discover trends and patterns from theusage of your data.
-
8/3/2019 Data Warehouse Concepts by Ramesh
38/49
Metadata in a DW is similar to the Data Dictionary (DD) or the Catalog in a Database ManagementSystem
In DD, we can keep the Information about
8/24/20118/24/2011 3838
Logical Data Structures
Information about the Files and Addresses
Information about the Indexes and so on
The DD contains Data about the Data in the Database
-
8/3/2019 Data Warehouse Concepts by Ramesh
39/49
ContdContd
Types Of Metadata:
Operational Metadata
Extraction and Transformation Metadata
8/24/20118/24/2011 3939
End-User Metadata
-
8/3/2019 Data Warehouse Concepts by Ramesh
40/49
ExampleExample
In general, an organization is started to earn money by selling a product or by providing service to theproduct. An organization may be at one place or may have several branches.
8/24/20118/24/2011 4040
,dimensions are product, location, time and organization.
Dimension tables have been explained in detail under the section Dimensions. With this example, we will tryto provide detailed explanation about STAR SCHEMA .
-
8/3/2019 Data Warehouse Concepts by Ramesh
41/49
Star Schema is a relational database schema for representing multi dimensional data.
It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables.
8/24/20118/24/2011 4141
It is called a star schema because the entity-relationship diagram between dimensions and fact tables
resembles astar where one fact table is connected to multiple dimensions.
The center of the star schema consists of a large fact table and it points towards the dimension tables.
The advantage of star schema are slicing down, performance increase and easy understanding of data.
-
8/3/2019 Data Warehouse Concepts by Ramesh
42/49
1) Identify a business process for analysis(like sales).
2) Identify measures or facts (sales dollar).
8/24/20118/24/2011 4242
, , ,organization dimension).
4) List the columns that describe each dimension.(region name, branch name, region name).
5) Determine the lowest level of summary in a fact table(sales dollar).
-
8/3/2019 Data Warehouse Concepts by Ramesh
43/49
Important aspects of Star Schema & Snow Flake Schema
1) In a star schema every dimension will have a primary key.
2) In a star schema, a dimension table will not have any parent table.
8/24/20118/24/2011 4343
3) Whereas in a snow flake schema, a dimension table will have one or more parent tables.
4) Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
5) Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps todrill down the data from topmost hierarchies to the lowermost hierarchies.
-
8/3/2019 Data Warehouse Concepts by Ramesh
44/49
A logical structure that uses ordered levels as a means of organizing data.
A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might beused to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level.
A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the
8/24/20118/24/2011 4444
.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at
the Month, Quarter, and Year levels.
-
8/3/2019 Data Warehouse Concepts by Ramesh
45/49
A table in a star schema that contains facts and connected to dimensions.
A fact table typically has two types of columns: those that contain facts and those that are foreign keys todimension tables.
The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
8/24/20118/24/2011 4545
A fact table might contain either detail level facts or facts that have been aggregated (fact tables that containaggregated facts are often instead called summary tables).
A fact table usually contains facts with the same level of aggregation.
-
8/3/2019 Data Warehouse Concepts by Ramesh
46/49
A snowflake schema is a term that describes a star schema structure normalized through the use of outriggertables.
i.e. dimension table hierarchies are broken into simpler tables.
In star schema example we had 4 dimensions like location, product, time, organization and a fact table (sales).
In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact
8/24/20118/24/2011 4646
table.
The reason is that hierarchies (category, branch, state, and month) are being broken out of the dimensiontables(PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately.
In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval ofdata.
In few organizations, they try to normalize the dimension tables to save space. Since dimension tables holdless space, Snowflake schema approach may be avoided.
-
8/3/2019 Data Warehouse Concepts by Ramesh
47/49
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.
8/24/20118/24/2011 4747
A fact table might contain either detail level facts or facts that have been aggregated (fact tables thatcontain aggregated facts are often instead called summary tables).
In the real world, it is possible to have a fact table that contains no measures or facts. These tables arecalled as Fact less Fact tables.
-
8/3/2019 Data Warehouse Concepts by Ramesh
48/49
1) Identify a business process for analysis(like sales).
2) Identify measures or facts (sales dollar).
3 Identif dimensions for facts roduct dimension, location dimension, time dimension, or anization
8/24/20118/24/2011 4848
dimension).
4) List the columns that describe each dimension.(region name, branch name, region name).
5) Determine the lowest level of summary in a fact table(sales dollar).
-
8/3/2019 Data Warehouse Concepts by Ramesh
49/49
8/24/20118/24/2011 4949