ch#3, by: babu ram dawadi 3. data warehouse architecture 3.1 system process 3.2 process architecture

36
CH#3, By: Babu Ram Dawa CH#3, By: Babu Ram Dawa di di 3. Data Warehouse 3. Data Warehouse Architecture Architecture 3.1 System Process 3.1 System Process 3.2 Process Architecture 3.2 Process Architecture

Upload: rudolph-mccoy

Post on 02-Jan-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

3. Data Warehouse 3. Data Warehouse ArchitectureArchitecture

3.1 System Process3.1 System Process

3.2 Process Architecture3.2 Process Architecture

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

System ProcessSystem Process

Data warehouse are built to support large data volumes Data warehouse are built to support large data volumes (above 100GB of database) cost effectively(above 100GB of database) cost effectively

Data warehouse must be architected to support three Data warehouse must be architected to support three major driving factors:major driving factors:

Populating the warehousePopulating the warehouse Day-to-day management of the warehouseDay-to-day management of the warehouse The ability to cope with requirements evolution.The ability to cope with requirements evolution.

The process required to populate the warehouse focus The process required to populate the warehouse focus on the extracting the data, cleaning it up and making it on the extracting the data, cleaning it up and making it available for analysis.available for analysis.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Typical Process FlowTypical Process Flow

Before we create an architecture of a data Before we create an architecture of a data warehouse, we must first understand the major warehouse, we must first understand the major process that constitute a data warehouse.process that constitute a data warehouse.

The processes are:The processes are: Extract and load the dataExtract and load the data Clean and transform data into a form that can cope Clean and transform data into a form that can cope

with large volumes, and provide good query with large volumes, and provide good query performance.performance.

Backup and achieve dataBackup and achieve data Manage queries, and direct them to the appropriate Manage queries, and direct them to the appropriate

data sources.data sources.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

QueryQuery

Process Flow Within a DWProcess Flow Within a DW

Source

Extract & Load

DataDataWarehouseWarehouse

Data TransformationAnd movement

UsersUsers

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Extract & Load ProcessExtract & Load Process

Data extraction takes data from source systems and makes it Data extraction takes data from source systems and makes it available to the data warehouseavailable to the data warehouse

Data load takes extracted data and load it into the DW.Data load takes extracted data and load it into the DW.

When we extract data from the physical database, whatever form it When we extract data from the physical database, whatever form it is held in, the original information content will have been modified is held in, the original information content will have been modified and extended over the years.and extended over the years.

Before loading the data into the DW, the information content must Before loading the data into the DW, the information content must be reconstructed.be reconstructed.

The DW extract & load process must take data and add context and The DW extract & load process must take data and add context and meaning in order to convert it into value-adding business meaning in order to convert it into value-adding business information.information.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow… Extract & LoadProcess Flow… Extract & Load

Process controllingProcess controlling Is the mechanism that determine when to start extracting the data, run Is the mechanism that determine when to start extracting the data, run

the transformations and consistency checks and so on, are very the transformations and consistency checks and so on, are very important.important.

The various tools, logic modules and programs are executed in the The various tools, logic modules and programs are executed in the correct sequence and at the correct time, a controlling mechanism is correct sequence and at the correct time, a controlling mechanism is required to fire each module when appropriate.required to fire each module when appropriate.

Initiate extractionInitiate extraction Data should be in a consistent state when it is extracted from the source Data should be in a consistent state when it is extracted from the source

system.system. The information in a data warehouse represents a The information in a data warehouse represents a snapshot snapshot of of

corporate information, so that the user is looking at a single, consistent corporate information, so that the user is looking at a single, consistent version of the truth.version of the truth.

Guideline:Guideline: start extracting data from data sources when it represents the start extracting data from data sources when it represents the same snapshot of time as all the other data sources.same snapshot of time as all the other data sources.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow… Extract & LoadProcess Flow… Extract & Load

ExtractionExtraction

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow…Extract & LoadProcess Flow…Extract & Load

Loading the dataLoading the data Once the data is extracted from the source systems, it Once the data is extracted from the source systems, it

is then typically loaded into a temporary data store in is then typically loaded into a temporary data store in order for it to be cleaned up and made consistent.order for it to be cleaned up and made consistent.

GuidelineGuideline: do not execute consistency checks until all : do not execute consistency checks until all the data sources have been loaded into the temporary the data sources have been loaded into the temporary data sources.data sources.

From the temporary data store when the data is From the temporary data store when the data is cleaned up, the data is transformed into warehouse cleaned up, the data is transformed into warehouse by the warehouse manager.by the warehouse manager.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow…Clean & transformProcess Flow…Clean & transform This is the system process that takes the loaded data and This is the system process that takes the loaded data and

structures it for query performance, and for minimizing structures it for query performance, and for minimizing operational costs.operational costs.

The process steps for cleaning and transferring are:The process steps for cleaning and transferring are: Clean and transform the loaded data into a structure that Clean and transform the loaded data into a structure that

speeds up queries.speeds up queries.

Partition the data in order to speed up queries, optimize Partition the data in order to speed up queries, optimize hardware performance and simplify the management of hardware performance and simplify the management of the DW.the DW.

Create a aggregations to speedup the common queries.Create a aggregations to speedup the common queries.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow…Clean & transformProcess Flow…Clean & transform

Data needs to be cleaned and checked in the following Data needs to be cleaned and checked in the following ways:ways: Make sure data is consistent within itself.Make sure data is consistent within itself.

Make sure that data is consistent with other data Make sure that data is consistent with other data within the same source.within the same source.

Make sure data is consistent with data in the other Make sure data is consistent with data in the other source system.source system.

Make sure data is consistent with the information Make sure data is consistent with the information already in the DW.already in the DW.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow…Backup & ArchiveProcess Flow…Backup & Archive

As in operational systems, the data within the data As in operational systems, the data within the data warehouse is backed up regularly in order to ensure that warehouse is backed up regularly in order to ensure that the DW can always be recovered from the data loss, the DW can always be recovered from the data loss, software failure or hardware failure.software failure or hardware failure.

In archiving, older data is removed from the system in a In archiving, older data is removed from the system in a format that allows it to be quickly restored if required.format that allows it to be quickly restored if required.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow…Query ManagementProcess Flow…Query Management

The query management process is the system process that The query management process is the system process that manages the queries and speeds up them up by directing queries to manages the queries and speeds up them up by directing queries to the most effective data source.the most effective data source.

The query management process may also be required to monitor The query management process may also be required to monitor the actual query profiles.the actual query profiles.

Unlike the other system processes, query management does not Unlike the other system processes, query management does not generally operate during the load of information into the DW.generally operate during the load of information into the DW.

The query management facilities are:The query management facilities are: Directing QueriesDirecting Queries

• The query management process determines which table delivers the answer The query management process determines which table delivers the answer effectively; by calculating which table would satisfy the query in the shortest effectively; by calculating which table would satisfy the query in the shortest space of time.space of time.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Flow…Query ManagementProcess Flow…Query Management

Query management facilities….Query management facilities…. Maximizing system resourcesMaximizing system resources

• Regardless of the processing power available to run the DW, it is also too Regardless of the processing power available to run the DW, it is also too possible that a single large query can soak up all system processes, possible that a single large query can soak up all system processes, affecting the performance of the entire system.affecting the performance of the entire system.

• The query management process must ensure that no single query can affect The query management process must ensure that no single query can affect the overall system performance.the overall system performance.

Query CaptureQuery Capture• Users are exploiting the information content of the DW, which implies that Users are exploiting the information content of the DW, which implies that

query profiles change on a regular basis over the life of a DW.query profiles change on a regular basis over the life of a DW.

• At various points in time , such as the end of week, these queries can be At various points in time , such as the end of week, these queries can be analyzed to capture the new query and the resulting impact on summary analyzed to capture the new query and the resulting impact on summary tables.tables.

• Query capture is typically the part of the query management process.Query capture is typically the part of the query management process.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process ArchitectureProcess Architecture

The system processes describe the major The system processes describe the major processes that constitute a data warehouse. processes that constitute a data warehouse.

Now the process architecture outline a complete Now the process architecture outline a complete data warehouse architecture that encompasses data warehouse architecture that encompasses these processes.these processes.

The complexity of each manager in a data The complexity of each manager in a data warehouse will vary from DW to DW.warehouse will vary from DW to DW.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Three Data Warehouse ModelsThree Data Warehouse Models

Enterprise warehouseEnterprise warehouse collects all of the information about subjects scanning the entire collects all of the information about subjects scanning the entire

organizationorganization

Data MartData Mart a subset of corporate-wide data that is of value to a specific a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected groups of users. Its scope is confined to specific, selected groups, such as marketing data martgroups, such as marketing data mart

• Independent vs. dependent (directly from warehouse) data martIndependent vs. dependent (directly from warehouse) data mart

Virtual warehouseVirtual warehouse A set of views over operational databasesA set of views over operational databases Only some of the possible summary views may be materializedOnly some of the possible summary views may be materialized

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process ArchitectureProcess Architecture

Components of DW ArchitectureComponents of DW Architecture Load ManagerLoad Manager Warehouse ManagerWarehouse Manager Query ManagerQuery Manager Detailed InformationDetailed Information Summary InformationSummary Information Meta DataMeta Data Data MartingData Marting

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process ArchitectureProcess Architecture

DetailedInformation

SummaryInformation

Meta data

OperationalData

ExternalData

L

O

A

D

M

A

N

A

G

E

R

Q

U

E

R

Y

M

A

N

A

G

E

RWarehouse Manager

Data Information Decision

OLAP Tools

Data

Differ

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Load ManagerProcess Arch… Load Manager The load manager is the system component that performs all the The load manager is the system component that performs all the

operations necessary to support the extract and load process.operations necessary to support the extract and load process.

This system may be constructed using a combination of off-the-self This system may be constructed using a combination of off-the-self tools, C programs and shell scripts.tools, C programs and shell scripts.

The size and complexity of Load Manager will vary between specific The size and complexity of Load Manager will vary between specific from DW to DW.from DW to DW.

The effort to develop the load manager should be planned The effort to develop the load manager should be planned within the first production phase.within the first production phase.

The architecture of the load manager is such that it performs The architecture of the load manager is such that it performs the following operations:the following operations:

Extract the data from the source systems.Extract the data from the source systems. Fast load the extracted data into a temporary data storeFast load the extracted data into a temporary data store Perform simple transformation into a structure similar to one in Perform simple transformation into a structure similar to one in

DWDW

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram DawadiLoad Manager Architecture

Process Arch… Load ManagerProcess Arch… Load Manager

Controlling

Process

Stored

Procedures

Copy Management

Tool

Fast Loader

File Structure

TemporaryData Source

WarehouseStructure

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Load ManagerProcess Arch… Load Manager

Extract data from sourceExtract data from source In order to get hold of the source data it has to be transferred In order to get hold of the source data it has to be transferred

from source systems, and made available to the DW.from source systems, and made available to the DW. Fast LoadFast Load

Data should be loaded into the warehouse in the fastest possible Data should be loaded into the warehouse in the fastest possible time, in order to minimize the total load window.time, in order to minimize the total load window.

The speed at which the data is processed into the warehouse is The speed at which the data is processed into the warehouse is affected by the kind of transformations that are taking place.affected by the kind of transformations that are taking place.

In practice, it is more effective to load the data into a relational In practice, it is more effective to load the data into a relational database prior to applying transformations and checkdatabase prior to applying transformations and check

Simple TransformationSimple Transformation Before or during the load there will be an opportunity to perform Before or during the load there will be an opportunity to perform

simple transformations on the data.simple transformations on the data.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Warehouse ManagerProcess Arch… Warehouse Manager

The warehouse manager is the system component that performs all The warehouse manager is the system component that performs all the operations necessary to support the warehouse management the operations necessary to support the warehouse management process.process.

This system is typically constructed using a combination of third This system is typically constructed using a combination of third party systems management software (C, shell scripts)party systems management software (C, shell scripts)

The architecture of the warehouse manager is such that it performs The architecture of the warehouse manager is such that it performs the following operations:the following operations:

Analyze the data to perform consistency and referential integrity check.Analyze the data to perform consistency and referential integrity check. Transform and merge the source data in the temporary data store in to Transform and merge the source data in the temporary data store in to

the DW.the DW. Generate renormalization if appropriate.Generate renormalization if appropriate. Backup totally the data within the DW.Backup totally the data within the DW.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Warehouse ManagerProcess Arch… Warehouse Manager

Controlling

Process

Stored

Procedures

Backup/Recovery

Tools

SQL Scripts

TemporaryData Source

WarehouseStructure

Schema

Warehouse Manager

Warehouse Manager Architecture

Guideline: do not load data directly into the DW tables until it has been cleaned up. Use temporary tables that emulate the structures with in the DW.

Guideline: do not load data directly into the DW tables until it has been cleaned up. Use temporary tables that emulate the structures with in the DW.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Warehouse ManagerProcess Arch… Warehouse Manager

Create Index & ViewsCreate Index & Views The warehouse manager has to create indexes against the information The warehouse manager has to create indexes against the information

in the fact or dimensional table.in the fact or dimensional table.

The overhead of inserting a row into a table and indexes can be higher The overhead of inserting a row into a table and indexes can be higher with a large number of rows than the overhead of recreating the indexes with a large number of rows than the overhead of recreating the indexes once the rows have been inserted.once the rows have been inserted.

Therefore it is more effective to drop all indexes against tables prior to Therefore it is more effective to drop all indexes against tables prior to inserting large rowsinserting large rows

The fact tables are large tables, so the warehouse manager creates The fact tables are large tables, so the warehouse manager creates views that combine a number of partitions into a single fact table.views that combine a number of partitions into a single fact table.

It is suggested that, we create a few views, corresponding to meaningful It is suggested that, we create a few views, corresponding to meaningful periods of time within the business.periods of time within the business.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Warehouse ManagerProcess Arch… Warehouse Manager

Generate the summariesGenerate the summaries:: Summary information is necessary in any organization because Summary information is necessary in any organization because

the higher level officers don’t want to see the detailed the higher level officers don’t want to see the detailed information.information.

The summary information will be helpful to them for decision The summary information will be helpful to them for decision making.making.

Summaries are generated automatically by the warehouse Summaries are generated automatically by the warehouse manager: i.e. it is executed every time data is loaded.manager: i.e. it is executed every time data is loaded.

The actual generation of summaries is achieved through the use The actual generation of summaries is achieved through the use of embedded SQL in either stored procedure (Trigger) or C of embedded SQL in either stored procedure (Trigger) or C programs.programs.

a Command sequence such as:a Command sequence such as:• Create table {…} as select {….} from {….} where {…..}Create table {…} as select {….} from {….} where {…..}

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Query ManagerProcess Arch… Query Manager

The query manager is the system component The query manager is the system component that performs all the operations necessary to that performs all the operations necessary to support the query management process.support the query management process.

The architecture of a query manager is such that The architecture of a query manager is such that it performs the following operations:it performs the following operations: Direct queries to the appropriate tablesDirect queries to the appropriate tables Schedule the execution of user queries.Schedule the execution of user queries.

The query manager also stores query profiles to The query manager also stores query profiles to allow the warehouse manager to determine allow the warehouse manager to determine which indexes are appropriate.which indexes are appropriate.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Query ManagerProcess Arch… Query Manager

Query Redirection Via C tools, RDBMS

Stored Procedure

(Generate Views)Query Management Tools

Query Scheduling via C tool or RDBMS

Meta Data DetailedInformation

SummaryInformation

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Detailed InformationProcess Arch… Detailed Information

This is the area of the data warehouse that stores all the detailed This is the area of the data warehouse that stores all the detailed information in the starflake schema.information in the starflake schema.

All the detailed information is held online the whole time, but aggregated to All the detailed information is held online the whole time, but aggregated to the next level of detail. And then the detailed information is offloaded into the next level of detail. And then the detailed information is offloaded into the tape archive.the tape archive.

If the business information for detailed information is weak or very specific, it If the business information for detailed information is weak or very specific, it may be possible to satisfy it by storing a rolling three-month detailed history.may be possible to satisfy it by storing a rolling three-month detailed history.

GuidelineGuideline: determine what business activities require detailed transaction : determine what business activities require detailed transaction information, in order to determine the level at which to retain detail information, in order to determine the level at which to retain detail information in the DW.information in the DW.

If the detailed information is being stored offline to minimize the disk storage If the detailed information is being stored offline to minimize the disk storage requirements, make sure that the data has been extracted, cleaned up, and requirements, make sure that the data has been extracted, cleaned up, and transformed into a starflake schema prior to archiving it.transformed into a starflake schema prior to archiving it.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Detailed InformationProcess Arch… Detailed Information

DetailedInformation

SummaryInformation

Meta data

OperationalData

ExternalData

L

O

A

D

M

A

N

A

G

E

R

Q

U

E

R

Y

M

A

N

A

G

E

RWarehouse Manager

Data Information Decision

Detailed info. In archived data

OLAP Tools

Data Differ

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Detailed InformationProcess Arch… Detailed Information

Detailed information can be managed by Detailed information can be managed by the topics:the topics: Data warehouse schemasData warehouse schemas Fact dataFact data Dimension dataDimension data Partitioning dataPartitioning data

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

StarStar

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Star SchemaStar Schema

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

CubeCube

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

A Sample Data CubeA Sample Data CubeTotal annual salesof TV in U.S.A.

Date

Produ

ct

Cou

ntr

ysum

sum

200 150 63 37

TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

450

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Summary InformationProcess Arch… Summary Information

Summary information is essentially a replication of information already in the Summary information is essentially a replication of information already in the data warehouse.data warehouse.

The implication of summary data is that the data:The implication of summary data is that the data: Exists to speed up the performance of common queriesExists to speed up the performance of common queries Increases operational costIncreases operational cost May have to be updated every time new data is loaded into the DWMay have to be updated every time new data is loaded into the DW May not have to be backed up, because it can be generated fresh from the May not have to be backed up, because it can be generated fresh from the

detailed info.detailed info. The size of data that needs to be scanned is an order of magnitude smaller, The size of data that needs to be scanned is an order of magnitude smaller,

this results in an order of magnitude improvement to the performance of the this results in an order of magnitude improvement to the performance of the query.query.

On the negative side there is an increase in operational cost, for creating On the negative side there is an increase in operational cost, for creating and updating the summary table on a daily basis.and updating the summary table on a daily basis.

Guideline1Guideline1: avoid creating a summary that require more than 200 : avoid creating a summary that require more than 200 centralized summary tables on an ongoing basis.centralized summary tables on an ongoing basis.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Process Arch… Summary InformationProcess Arch… Summary Information

Summary info…contdSummary info…contd Guideline2: inform users that summary table accessed Guideline2: inform users that summary table accessed

infrequently will be dropped on an ongoing basis.infrequently will be dropped on an ongoing basis. MetadataMetadata Data MartingData Marting

A data mart is a subset of the information content of a DW that is A data mart is a subset of the information content of a DW that is stored in its own database, summarized or in detail.stored in its own database, summarized or in detail.

Data marting can improve query performance, simply be Data marting can improve query performance, simply be reducing the volume of data needs to be scanned to satisfy a reducing the volume of data needs to be scanned to satisfy a query.query.

Data marts are created along functional or departmental lines, in Data marts are created along functional or departmental lines, in order to exploit a natural break of the data. order to exploit a natural break of the data.

CH#3, By: Babu Ram DawadiCH#3, By: Babu Ram Dawadi

Multi-Tiered ArchitectureMulti-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

other

sources

Data Storage

OLAP Server