data warehouse notes_sk

21
Data Warehouse Business Intelligence Combination of technologies like Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) Data Visualization (VIS) Decision Analysis (what-if) Customer Relationship Management (CRM) Operational Data Presents a dynamic view of the business Must be kept up-to-date and current at all times Updated by transactions entered by data-entry operators or specially trained end users Is maintained in detail Utilization is predictable. Systems can be optimized for projected workloads High volume of transactions, each of which affects a small portion of the data Users do not need to understand data structures Functional orientation Analytical Data Presents a static view of the business End-user access is usually read-only More concerned with summary information Usage is unpredictable in terms of depth of information needed by the user Smaller number of queries, each of which may access large amounts of data Users need to understand the structure of the data (and business rules) to draw meaningful conclusions from the data Subject -orientation Database Broadly classified into

Upload: shiva-ch

Post on 13-May-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehouse Notes_SK

Data Warehouse

Business IntelligenceCombination of technologies like

Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) Data Visualization (VIS) Decision Analysis (what-if) Customer Relationship Management (CRM)

Operational Data Presents a dynamic view of the business Must be kept up-to-date and current at all times Updated by transactions entered by data-entry operators or specially trained end users Is maintained in detail Utilization is predictable. Systems can be optimized for projected workloads High volume of transactions, each of which affects a small portion of the data Users do not need to understand data structures Functional orientation

Analytical Data Presents a static view of the business End-user access is usually read-only More concerned with summary information Usage is unpredictable in terms of depth of information needed by the user Smaller number of queries, each of which may access large amounts of data Users need to understand the structure of the data (and business rules) to draw

meaningful conclusions from the data Subject -orientation

DatabaseBroadly classified into

1. OLTP (Online Transactional Processing) DB2. OLAP (Online Analytical Processing) DB

OLAP Slicing and dicing of data is called as Online Analytical Processing (OLAP). OLAP only serves the needs of data warehousing than OLTP. OLAP systems allow ad hoc processing and support access to data over time

periods. OLAP systems are the aggregation, transformation, integration and historical

collection of OLTP data from one or more systems. Typical OLAP operations:

1. Roll up (drill up)

Page 2: Data Warehouse Notes_SK

- summarize data by climbing hierarchy or by dimension reduction.

2. Drill down(roll down)- from higher level summary to lower level summary or detailed data, or - introducing new dimensions

3. Slice and dice- project and select

4. Pivot (rotate)- reorient the cube, visualization, 3D to series of 2D planes.

OLAP vs OLTP

Slno OLTP OLAP1. Transaction Oriented Decision Oriented (Reports)2. Complex data model (fully

normalized)Simple data model (multidimensional/de-normalized)

3. Smaller data volume (few historical data)

Larger data volumes (collection of historical data)

4. Many, ”small” queries Fewer, but ”bigger” queries

5. Frequent updates Frequent reads, in-frequent updates (daily)

6. Huge no. of users(clerks). Only few users(Management Personnel)

Objective of Data WarehouseThe primary purpose of a data warehouse is to provide easy access to specially

prepared data that can be used with decision support applications, such as management reporting, queries, decision support systems, and executive information systems.

Decision SupportA Decision Support System (DSS) is a system that provides managers with

information they need to make decisions. These systems have the effect of empowering employees at all levels, providing them access to business and financial information that directly impact their productivity and quality of work

Executive information systems

An Executive information system (EIS) is a concise snapshot of how the company is doing today. Consider it as an electronic executive briefing. EIS allows greater flexibility in “slicing-and-dicing” data, i.e.; it allows exploration of data through multiple dimensions or views. Why Datawarehouse?

By centralizing data

Page 3: Data Warehouse Notes_SK

1. The queries can be answered locally without accessing the original information sources. Thus, high query performance can be obtained for complex aggregation queries that are needed for in-depth analysis, decision support and data mining – a way of extracting relevant data from a vast database.

2. On-line Analytical Processing (OLAP) is decoupled (separated) as much as possible from On-line Transaction Processing (OLTP). Thus making information accessible to decision makers avoiding interference of OLAP with local processing at the operational sources.

Data warehouse A decision support database that is maintained separately from the organization’s

operational databases A Data Warehouse is an enterprise-wise collection of

Subject oriented Integrated Time variant Non-volatile

data in support of management’s decision making process.- W. H. Inmon, 1993

*Subject Oriented - Data warehouses focuses on high-level business entities like sales,marketing,etc.

*Integrated - Data in the warehouse is obtained from multiple sources and kept in a consistent format.

*Time-Varying - Every data component in the date warehouse associates itself with some point of time like weekly,monthly,quarterly, yearly

*Non-volatile - Dw stores historical data. Data does not change once it gets into the warehouse. Only load/refresh.

Data from the operational systems are Extracted Cleansed Transformed

1. case conversion,2. data trimming,3. concatenation,4. datatype conversion

Aggregated Loaded into DW Periodically refreshed to reflect updates at the sources and purged from

the warehouse onto slower archival storage.

Page 4: Data Warehouse Notes_SK

Use of DWH Ad-hoc analyses and reports Data mining: identification of trends Management Information Systems

Designing a database for a Data Warehouse

1. Define User requirements, considering different views of users from different departments.

2. Identify data integrity, synchronization and security issues/bottlenecks.3. Identify technology, performance, availability & utilization requirements.4. Review normalized view of relational data to identify entities.5. Identify dimensions.6. Create and organize hierarchies of dimensions.7. Identify attributes of dimensions.8. Identify fact table(s).9. Create data repository (metadata).10. Add calculations.

Datamart Datamart is a subset of data warehouse and it is designed for a particular line of

business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data

warehouse. In an independent data mart, data can be collected directly from sources May be structured for specific access tools Datamart is the data warehouse you really use Why Datamart?

1. Datawarehouse projects are very expensive and time taking.2. Success rate of DWH projects is very less

To avoid single point of loss we identify department wise needs and build Datamart. If succeeded we go for other departments and integrate all datamarts into a Datawarehouse.

Advantages Improve data access performance Simplify end-user data structures Facilitate ad hoc reporting

Slno Data warehouse Data mart

1. DW Operates on an enterprise level and contains all data used for reporting and analysis

Data Mart is used by a specific business department and is focused on a specific subject (business area).

DM is a subset of DWH

Page 5: Data Warehouse Notes_SK

DWH ARCHITECHTUREData warehouse architecture is a way of representing the overall structure of data,

communication, processing and presentation that is planned, for end-user computing within the enterprise. The architecture has the following main parts: Operational data base Information access layer Data Access layer Data dictionary (metadata) layer Process management layer Application messaging layer Processing (Data Warehouse) layer Data Staging layer

Operational data is the information related to day-to-day functioning of an organization. An operational database stores business transactions critical to the functioning of the organization.

Information access layer is the layer that the end-user deals with directly. Examples of these are ad-hoc query tools like Business Objects, Power Play and Impromptu.

Data access layer is the data interchange layer. This layer provides interface between operational data bases and information access layers. The common data language used is ‘SQL’. A familiar example of a data access layer is ‘ODBC’.

Metadata layer holds a repository of Metadata information. Metadata is defined as data about data, resulting in an intelligent, efficient way to manage data. Metadata provides the structure and content of the data warehouse, source and mapping information, transformation / integration description and business rules. It is essential for quality improvement in a Data Warehouse.

Page 6: Data Warehouse Notes_SK

Process management layer is involved in scheduling the various tasks that must be executed to build and maintain the data warehouse and data repository. It also helps to keep the Data Warehouse up-to-date.

Application messaging layer transports information around the enterprises’ computing network. It also acts as ‘middle-ware’ and isolates applications from exact data format on either end.

Processing (data warehouse) layer is the logical view of the informational data. It also performs the summarization, loading and processing of data from operational databases.

Data staging layer manages data replication across servers. It also manages data transformation.

ETL

1. ETL means Extraction, transformation, and loading.2. ETL refers to the methods involved in accessing and manipulating source data

and loading it into target database.

Page 7: Data Warehouse Notes_SK

ETL ProcessEtl is a process that involves the following tasks: extracting data from source operational or archive systems which are the primary

source of data for the data warehouse transforming the data - which may involve cleaning, filtering, validating and

applying business rules loading the data into a data warehouse or any other database or application that

houses data

Transform1. Denormalize data2. Data cleaning.3. Case conversion4. Data trimming5. String concatenation6. datatype conversion7. Decoding8. calculation9. Data correction.

CleansingThe process of resolving inconsistencies and fixing the anomalies in source data,

typically as part of the ETL process.

Data Staging Area1. Most complex part in the architecture.2. A place where data is processed before entering the warehouse3. It involves...

Extraction (E) Transformation (T) Load (L) Indexing

Popular ETL Tools

Tool Name Company NameInformatica Informatica CorporationDT/Studio Embarcadero TechnologiesDataStage IBMAb Initio Ab Initio Software CorporationData Junction Pervasive SoftwareOracle Warehouse Builder Oracle CorporationMicrosoft SQL Server Integration MicrosoftTransformOnDemand SolondeTransformation Manager ETL Solutions

Page 8: Data Warehouse Notes_SK

Dimensional Modeling Means storing data in fact and dimension tables. Here data is fully denormalized

Dimension table1. Dimension table gives the descriptive attributes of a business.2. They are fully denormalized3. It has a primary key4. Data arranged in hierarchical manner (product to category; month to year) – if so

we can use for drill down and drill up analysis5. Has less no. of records6. Has rich no. of columns7. Heavily indexed8. Dimension tables are sometimes called lookup or reference tables.

Types of Dimensions1. Normal Dimension2. Confirmed Dimension3. Junk Dimension4. Degenerated Dimension5. Role Playing Dimension

Confirmed DimensionDimension table used by more than one fact table is called Confirmed Dimensions

(dimensions that are linked to multiple fact tables)

D1 D2 D1 D2 D5

FT1 FT2 FT3

D3 D4 D3

Adv:1. To avoid unnecessary space2. Reduce time3. Drill across fact table

Junk Dimension is an abstract dimension it will remove number of foreign keys from fact table. This is achieved by combining 2 or more dimensions into a single dimension.

Page 9: Data Warehouse Notes_SK

Degenerated DimensionMeans a key value or dimension table which does not have descriptive attributes.

i.e.) a non foreign key and non numerical measure column used for grouping purpose

Ex : Invoice Number, Ticket Number

Role Playing DimensionMeans a single physical dimension table plays different role with the help of

views.

Page 10: Data Warehouse Notes_SK

Fact Table1. The centralized table in a star schema is called as FACT table2. A fact table typically has two types of columns:

Numerical measures and Foreign keys to dimension tables.

3. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys

4. Fact tables store different types of measures like additive, non additive and semi additive measures

5. A fact table might contain either detail level facts or facts that have been aggregated

6. A fact table usually contains facts with the same level of aggregation.7. Has millions of records

Measure Types

Additive - Measures that can be summarized across all dimensions. o Ex: sales

Non Additive - Measures that cannot be summarized across all dimensions. o Ex: averages

Semi Additive - Measures that can be summarized across few dimensions and not with others.

o Ex: inventory levels

Page 11: Data Warehouse Notes_SK

Factless FactA fact table that contains no measures or facts is called as Factless Fact table.

Slowly Changing Dimensions

1. Dimensions that change over time are called Slowly Changing Dimensions

2. Slowly Changing Dimensions are often categorized into three types namely

Type1, Type2 and Type3

Type 1 SCD : Used if history is not required Overwriting the old values.

Product Price in 2004:Product ID(PK) Year Product Name Product Price1 2004 Product1 $150

Product Price in 2005:Product ID(PK) Year Product Name Product Price1 2005 Product1 $250

Type 2 SCD: If history and current value needed Creating another additional record.(new record with new changes and new

surrogate key) Mostly preferred in dimensional modeling

ProductProduct ID(PK)

Effective DateTime(PK) Year Product

NameProduct

PriceExpiry

DateTime

1 01-01-2004 12.00AM 2004 Product1 $150 12-31-2004

11.59PM1 01-01-2005 2005 Product1 $250

Page 12: Data Warehouse Notes_SK

12.00AM

Type 3 SCD: Used if changes are very less Previous one level of history available Creating new fields.

Product Price in 2005

Product ID(PK) Current Year

Product Name

Current Product Price

Old Product Price Old Year

1 2005 Product1 $250 $150 2004

Surrogate keys Surrogate keys are always numeric and unique on a table level which makes it

easy to distinguish and track values changed over time. Surrogate keys are integers that are assigned sequentially as needed to populate a

dimension. Surrogate keys merely serve to join dimensional tables to the fact table. Surrogate keys are beneficial as the following reasons:

1. Reduces space used by fact table2. Faster retrieval of data ( since alphanumerical retrieval is costlier than

numerical data)3. Maintaining index is easier with numeric key.4. Maintain all slowly changing dimenion.

Data warehouse DesignThe data warehouse design essentially consists of four steps, which are as

follows:1. Identifying facts and dimensions2. Designing fact tables3. Designing dimension tables4. Designing database schemas

Types of database schemasThere are three main types of database schemas:

1. Star Schema,2. Snowflake Schema and3. Starflake schema.

Star Schema

1. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables

2. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions

Page 13: Data Warehouse Notes_SK

3. The center of the star schema consists of a large fact table and it points towards the dimension tables

4. Fact Table = Highly Normalized Dimension Table = Highly denormalized.

5. It can be very effective to treat fact data as primarily read-only data, and dimensional data as data that will change over a period of time

Advantages: Star schema is easy to define. It reduces the number of physical joins. Provides very simple metadata. Drawbacks: Summary data in Fact tables (such as Sales amount by region, or district-wise, or year-

wise) yields poor performance for summary levels and huge dimension tables.

Steps in designing Star Schema

1. Identify a business process for analysis (like sales). 2. Identify measures or facts (sales dollar). 3. Identify dimensions for facts (product dimension, location dimension, time

dimension, organization dimension). 4. List the columns that describe each dimension. (Region name, branch name,

employee name). 5. Determine the lowest level of summary in a fact table (sales dollar).

Page 14: Data Warehouse Notes_SK

Fact constellation: Dimension tables will, in turn, have their own dimension tables. In this case, the

Store dimension will contain District ids and Region ids, which will reference district and region dimensions of Store dimension, respectively. This Schema is called Fact Constellation Schema.

Snowflake schema

1. A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierarchies are broken into simpler tables

2. Represent dimensional hierarchy directly by normalizing the dimension tables ie) all dimensional information is stored in third normal form

3. This implies dividing the dimension tables into more tables, thus avoiding non-key attributes to be dependent on each other.

Advantages: Snowflake schema provides best performance when queries involve aggregation.

Disadvantages: Maintenance is complicated. Increase in the number of tables.More joins will be needed

Snowflake Schema

Starflake Schema

1. combinations of denormalized Star and normalized Snowflake schemas.

Star Schema vs Snowflake Schema

Slno Star Schema Snow Flake1. Dimension table will not have any Dimension table will have one or more

Page 15: Data Warehouse Notes_SK

parent table parent tables2. Hierarchies for the dimensions are

stored in the dimensional table itselfHierarchies are broken into separate tables in snow flake schema

Granularity Means what detail data to be stored in fact table Types of Granularity

1. Transactional Level Granularity2. Periodic Snapshot Granularity

Transactional Level Granularity Mostly used Each and every transaction stored in fact table Drill down and drill up analysis can be done Disadvantage

1. Size increases.Periodic Snapshot Granularity

Summarizing data over a period is stored in fact table Adv : Faster retrieval (less records) Disadv : Detail information not available

FAQ

Hierarchy 1. Hierarchies are logical structures that use ordered levels as a means of organizing

data. 2. A hierarchy can be used to define data aggregation.

Example country>city>state>zip in a time dimension, a hierarchy might be used to aggregate data from the Month

level to the Quarter level, from the Quarter level to the Year level. LevelA position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.

Operational Data Store In recent times, OLAP functionality is being built into OLTP systems which is

called ODS (operational data store). A physical set of tables sitting between the operational systems and the data

warehouse or a specially administered hot partition of the data warehouse itself. The main reason of ODS is to provide immediate reporting of operational results

if neither the operational system nor the regular data warehouse can provide satisfactory accsee.

Page 16: Data Warehouse Notes_SK

Since an ODS is necessarily an extract of the operational data, it also may play the role of source for data warehouse.

Data Staging Area1. A storage area that clean, transform, combine, duplicate and prepare source data

for use in the data warehouse. 2. The data staging area is everything in between the source system and data

presentation server. 3. No querying should be done in the data staging area because the data staging area

normally is not set up to handle fine-grained security, indexing or aggregation for performance.

Data Warehouse Bus Matrix1. The matrix helps prioritize which dimensions should be tackled first for

conformity given their prominent roles.2. The matrix allows us to communicate effectively within and across data mart

teams.3. The columns of the matrix represent the common dimensions.4. The rows identify the organizations business processes.

Degenerated DimensionOperational control numbers such as invoice numbers, order numbers and bill of lading numbers looks like dimension key in a fact table but do not join to any actual dimension table. They give rise to empty dimension hence we refer them as Degenerated Dimension(DD).