modeling&etldesign.pdf
DESCRIPTION
Modeling&ETLDesign.pdfTRANSCRIPT
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
1
Data WarehousingFor the Participants of IBM Bangalore
Prepared ByChristopher Richard
Data Warehousing System Architect[Microsoft Certified Trainer]
2
OBJECTIVESThis Training is for you, the Designers, managers, and owners of the data warehouse.This Training is a field guide, a set of tools, for designing, developing, and deploying data warehouses.Concrete and actionableThe training describes a coherent framework that goes all the way from the original scoping of an overall data warehouse, through all the detailed steps of developing and deploying the data warehouse.Along the way, I hope to give you the perspective and judgment I have accumulated in doing several data warehouse installations and consultation assignments since 1996
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
3
OBJECTIVESAchieve your goals of building a data warehouse more quicklyBuild effective data warehouses that match well against the goals.And Make fewer mistakes along the wayYou will not reinvent the wheel and discover previously owned truths.Structure and discipline to help in building a large and complex data warehouse.
4
Evolution of Data Warehousing
How Did We Get Here?
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
5
The progression1st data warehouse in 1905 by Dupont Corp
1st data cube by sales, branch and date
1970s - Management Decision Systems developed product called Express (Oracle) 1983 Metaphor - founded by Ralph Kimball and 2 partners as standalone DSS
Lessons learned - manage information as corporate resource
1980 - E.F.Codd - Promise of relational databases (data every which way)Inmon 1993 - Popularisation of the term
6
Evolution through 90sReporting Summarization EIS applications OLAP Data Mining Intelligent Agents Active Warehouses
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
7
Data Warehousing Industry
8
Data Warehousing Industry
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
9
IntroductionThe data warehouse marketplace has moved beyond its infancyA data warehouse is continuously evolving and dynamic.A data warehouse cannot be static.Complete Lifecycle perspective.At the very least, a data warehouse needs to evolve as fast as the surrounding organization evolves.Adjust our expectations and our techniques from the original idealistic, static view
10
IntroductionWe need design techniques that are flexible and adaptable.We need to be half DBA and half MBA.We need our changes to the data warehouse to always be graceful.There is a number of security topics you simply have to understand if you are going to perform your job responsibly.Welcome to Data Warehousing!!!!
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
11
MESSAGE:Information Requirements are Increasing -
Geometrically
A Goodly Chunk of Them will have to be Met, so Build a Data Warehouse
BUT, BEFORE YOU BUILD A DATA WAREHOUSE
!The DW Consultants will Steal You Blind
INFORM YOURSELF - If You Dont
12
TO INFORM YOURSELF:!READ: The Data Warehouse Toolkit
!READ: The Data Warehouse Lifecycle Toolkit
!JOIN: This Data Warehouse Training Program
!ATTEND One Implementation Conference
!WATCH Every Presentation on Data Warehousing you can
!SUSCRIBE to these Listservs
!DW-List: http://www.datawarehousing.com/list.asp
!EduCause: http://www.educause.edu/memdir/cg/cg.html
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
13
The Goals of a Data WarehouseThe Most important assets of an organization is almost always kept in two forms
The Operational systems of recordThe Data Warehouse
Ultimately, we need to put aside the details of implementation and modeling, and remember what the fundamental goals of the data warehouse are.Makes an organizations information accessibleMakes the organizations information consistentIs an adaptive and resilient source of informationIs a secure bastion that protects our information assetIs the foundation for decision makingIs accepted and used by the end user
14
The Chess PiecesSource System-
An operational system of record whose function it is to capture the transactions of the businessMain Properties of a source system are uptime and and availability.
Data Staging Area-A Storage area and set of processes that clean, transform, combine, de-duplicate, household, archive and prepare source data for use in the data warehouse.No User Query services
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
15
The Chess PiecesPresentation Server -
The target physical machine on which the data warehouse data is organized and stored for direct querying by end users, report writers, and other applications.
Dimensional Model A specific discipline for modeling data that is an alternative to entity relationship (E/R) modeling.
Business Process A coherent set of business activities that make sense to the business users of our data warehouses
16
The Chess PiecesROLAP ( Relational OLAP )
A storage option or set of user interfaces and applications that give a relational database a dimensional flavor.
MOLAP ( Multidimensional OLAP) A storage option or set of user interfaces and applications and proprietary database technology that have a strongly dimensional flavor.
HOLAP ( Hybrid OLAP) A storage option of both relational and proprietary structure.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
17
The Chess PiecesData Mart
A logical subset of the complete data warehouse.Data Warehouse -
The queryable source of data in the enterprise.OLAP (On-line Analytic Processing)
The general activity of querying and presenting text and number data from data warehouses, as well as a specifically dimensional style of querying and presenting that is exemplified by a number of OLAP vendors
18
The Chess PiecesEnd User Application
A collection of tools that query, analyze, and present information targeted to support a business need.
End User Data Access Tool -A client of the data warehouse.
Ad Hoc Query Tool A specific kind of end user data access tool that invites the user to form their own queries by directly manipulating relational tables and their joins.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
19
View the dataCreate reportsAd-hocFine TuningAll DoneNOT!!
20
The Chess PiecesModeling Applications
A sophisticated kind of data warehouse client with analytic capabilities that transform or digest the out put from the data warehouse.Modeling applications include :Forecasting modelsBehavior scoring modelsAllocation modelsData mining tools
Metadata All the information in the data warehouse environment that is not the actual data itself.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
21
DWH Architecture
supports
External Sources
Data Warehouse
OLAP Servers
Tools for extraction, cleaning, loading,
integration, etc.
Data Marts
Operational DBsClient Tools
Information Sources
Data Mining
OLAP tools for Queries/Reports
Analysis
22
Two Different WorldsOLTP is profoundly different from dimensional data warehousing.Design techniques and design instincts appropriate for transaction processing are inappropriate and even destructive for data warehousing.Consistency
OLTP consistency is microscopicAll we care about is that all transactions presented to the system have been accountedData warehouse has a quality assurance perspective.We care enormously that the current load of data is a full and consistent set of data
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
23
Two Different WorldsTransaction
OLTP system processes thousands or even millions of transitionsDW will process only one transaction per day.We call it a Production Data Load
Users and ManagersOLTP system users turn the wheels of an organizationOLTP system users almost always deal with one account at a time.They perform the same task many, many times.Performance is the absolute king of the OLTP systemReporting is the primary activity of the Data warehouse.
24
Two Different WorldsOne Machine or Two
The resource argument is usually sufficient reason to require a second machineThe data warehouse is often a centralized resource where data isintegrated from multiple remote OLTP systems.Data must be copied and restructured from the DW.
The Time DimensionOLTP database is a twinkling databaseThis is the first temporal inconsistency that we avoid in a datawarehouse.It is a major burden on the OLTP system to correctly depict old history.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
25
Two Different WorldsThe Entity Relational Data Model
E/R model the MiracleDrives out redundancyThe closest analogy is to the map of Los Angles.The E/R model is very symmetric.Huge number of connection paths between tables.The value of the E/R model is to use the tables individually andin pairsE/R models are a disaster for querying coz they cannot be understood by users.And cannot be navigated usefully by DBMS software.E/R model cannot be used as the basis for an enterprise DW.
26
A small subset of tables of an existing system
Typical ERDs
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
27
Northwind Database Model Relational Format
Categories
PK CategoryID
I1 CategoryNameDescriptionPicture
Territories
PK TerritoryID
TerritoryDescriptionFK1 RegionID
Products
PK ProductID
I3 ProductNameFK2,I4,I5 SupplierIDFK1,I2,I1 CategoryID
QuantityPerUnitUnitPriceUnitsInStockUnitsOnOrderReorderLevelDiscontinued
CustomerCustomerDemo
PK,FK2 CustomerIDPK,FK1 CustomerTypeID
CustomerDemographics
PK CustomerTypeID
CustomerDesc
EmployeeTerritories
FK2 TerritoryIDFK1 EmployeeID
Customers
PK CustomerID
I2 CompanyNameContactNameContactTitleAddress
I1 CityI4 RegionI3 PostalCode
CountryPhoneFax
Region
PK RegionID
RegionDescription
Order Details
PK,FK1,I2,I1 OrderIDPK,FK2,I4,I3 ProductID
UnitPriceQuantityDiscount
Shippers
PK ShipperID
CompanyNamePhone
Orders
PK OrderID
FK1,I1,I2 CustomerIDFK2,I4,I3 EmployeeIDI5 OrderDate
RequiredDateI6 ShippedDateFK3,I7 ShipVia
FreightShipNameShipAddressShipCityShipRegion
I8 ShipPostalCodeShipCountry
Suppliers
PK SupplierID
I1 CompanyNameContactNameContactTitleAddressCityRegion
I2 PostalCodeCountryPhoneFaxHomePage
Employees
PK EmployeeID
I1 LastNameFirstNameTitleTitleOfCourtesyBirthDateHireDateAddressCityRegion
I2 PostalCodeCountryHomePhoneExtensionPhotoNotes
FK1 ReportsToPhotoPath
28
The Dimensional ModelA Simple data cube structure that matches end users needs for simplicityThe dimensional model is very asymmetric.One large dominant table in the center of the schema.It is the only table in the schema with multiple joins.The center table is called the Fact Table.The other tables are called the Dimension Tables.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
29
Components of a Star SchemaComponents of a Star Schema
30
Star Schema ExampleStar Schema Example
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
31
Northwind Database Star Schema Orders
d im C u s to m e rs
P K C u s to m e r K e y
C u s to m e r IDC o m p a n y N a m eC o n ta c tN a m eC o n ta c tT it leA d d re s sC ityR e g io nP o s ta lC o d eC o u n tryP h o n eF a xC u s to m e rT y p e IDC u s to m e rD e s c
d im S h ip p e rs
P K S h ip p e rK e y
S h ip p e r IDC o m p a n y N a m eP h o n e
fc tO rd e rs
P K O r d e r K e y
F K 3 P ro d u c tK e yF K 2 E m p lo y e e K e yF K 1 C u s to m e rK e yF K 4 S h ip p e rK e yF K 6 O rd e rD a te K e yF K 5 R e q u ire d D a te K e yF K 7 S h ip p e d D a te K e y
O rd e r IDS h ip V iaF re ig h tS h ip N a m eS h ip A d d re s sS h ip C ityS h ip R e g io nS h ip P o s ta lC o d eS h ip C o u n try
d im E m p lo y e e s
P K E m p lo y e e K e y
E m p lo y e e IDL a s tN a m eF irs tN a m eT it leT it le O fC o u rte s yB ir th D a teH ire D a teA d d re s sC ityR e g io nP o s ta lC o d eC o u n tryH o m e P h o n eE x te n s io nP h o toN o te sR e p o r ts T oP h o to P a thT e r r ito ry IDT e r r ito ry D e s c r ip t io nR e g io n IDR e g io n D e s c r ip t io n
d im O rd e rD e ta ils
P K P r o d u c tK e y
O rd e r IDU n itP r ic eQ u a n tityD is c o u n tE x te n d e d P r ic eP ro d u c tIDP ro d u c tN a m eQ u a n tity P e rU n itU n itP r ic eU n its In S to c kU n its O n O rd e rR e o rd e rL e v e lD is c o n t in u e dC a te g o ry IDC a te g o ry N a m eD e s c r ip t io nS u p p lie r IDC o m p a n y N a m eC o n ta c tN a m eC o n ta c tT it leA d d re s sC ityR e g io nP o s ta lC o d eC o u n tryP h o n eF a xH o m e P a g e
d im D a te
P K D a te K e y
D a y D a teD a y D a te _ Y Y Y Y M M D DD a y O fW e e k N a m eD a y O fW e e k N a m e A b b rvD a y N u m b e rIn W e e kD a y N u m b e rIn M o n thD a y N u m b e rIn Q u a r teD a y N u m b e rIn Y e a rW e e k D a y In d ic a to rW e e k E n d In d ic a to rW e e k _ Y Y Y Y W WW e e k N u m b e r In Y e a rM o n th _ Y Y Y Y M MM o n th N a m eM o n th N a m e A b b rvM o n th N u m b e r In Y e a rQ u a r te r_ Y Y Y Y QQ u a r te rN a m eQ u a r te rN a m e A b rvQ u a r te rN u m b e r In Y e a rY e a r
32
Dimensions in Data AnalysisIn the world of data warehousing, a summarizable numerical value that you use to monitor your business is called a FACT
When looking for numeric information your first question will be What Fact U want to see?
You could look at lets say, sales units, sales dollars, defects etc.
Suppose that U ask to see a report of your companys Units Sold.Heres what u get:
113
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
33
Dimensions in Data AnalysisLooking at one value doesnt tell you much. You want to break it into some thing more informative. For example, how has your company done over time.You ask for a monthly report on Units SoldHeres the new report
January February March April
14 41 33 25
34
Dimensions in Data AnalysisYour Still not satisfied with the monthly report. Your company sells more than one product how did each of those products do over time?You ask for a new report on Units Sold by product and timeHeres the new report
6 17Feb Mar AprJan
Salt BreadSweet BreadMuffins
68
1625
621
8
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
35
Dimensions in Data AnalysisSuppose your company sells in two different states and you would like to know how each product is doing each month in each state.You ask for a new report on Units Sold by product by time and stateHeres the new report
3 10Feb Mar AprJan
Salt BreadSweet BreadMuffins
34
1616
66
3 7Salt BreadSweet BreadMuffins
34 9 15
8
KA
TN
36
Dimensions in Data AnalysisWhichever way you layout your report, it has 3 independent list of labels
The total number of potential values in the report equals the number of unique items in the first independent list of labels(2 States) & the number of unique items in the second independent list of labels(3 products) * the number of unique items in the third independent list of labels(4 months)
In place of independent list of labels, data warehouse designers borrow the term dimension from mathematics.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
37
Dimensions in Data AnalysisThus our report has 3 dimensions TIME, STATE and PRODUCTSThe items in a dimension are called members of that dimension.
38
Hierarchies in Data AnalysisGrouping aggregating is the way that humans deal with numerous items.
Once your company has sold items for over a year you would like to look at reports for a year, quarter and month.
But how do aggregations such as quarters fit into a dimension.
Generally you think of members in a dimension as belonging together
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
39
Hierarchies in Data AnalysisDo months and Qtr belong togetherMonths & Quarters form an hierarchy within the Time Dimension, and each degree of summarization is referred to as a level.The member at the lowest level of detail are called leaf members.There are 3 types of hierarchies that you may encounter
Balanced HierarchiesUnbalanced HierarchiesRagged Hierarchies
40
Balanced Hierarchies1998
Qtr1 Qtr2 Qtr3 Qtr4
Jan Feb Mar
Apr May Jun AugJul Sep
Oct Nov Dec
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
41
Unbalanced HierarchiesSheri
Darren Maya
Rebecca Walter
Brenda Jonathan
42
Ragged HierarchiesNorth America
USA Canada Mexico
North West
CaliforniaOregon WashingtonBrit
ColumbiaDist
FederalZacatecas
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
43
Fact TableA Fact Table is a table in the relational data warehouse that stores the detailed values for measures, or facts.Example a fact table that stores Dollars and Units by state, by product and by Month has five columns.
The first 3 columns are Key columns, the remaining two are measure values.
State Product Month Units Dollars
44
Fact TableEach column in the fact table should be either a key or a measure.
The fact table must contain a column for each measure.
The fact table must contain rows at the lowest level of detail you might want to retrieve for a measure.
A fact table almost always uses an integer key for each member rather than a descriptive name.
The key column for a date dimension might be either an integer key or a date.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
45
Dimension TablesA dimension table contains one row for each leaf level member of the dimension.Ex. A product dimension table with 3 products will have 3 rows.
In most cases a dimension table also contains one column containing a numeric key columns that uniquely identifies each member.
This column that contains the unique value is the primary key and references the foreign key in the fact table.
46
Dimension TablesIf the dimension is involved in a balanced hierarchy it will have an additional column that gives the parent for each member.Ex.if you have 3 products in a dimension table that belong to a particular product Subcategory your table will look like this.
PROD_ID Prod_Name SubCategory
5895921218
Sweet MuffinsCoconut MuffinsSalt Bread
MuffinsMuffinsBread
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
47
Star SchemaWhen each dimension is stored in a single table, the databases organization is called a star Schema Design.
When a Database Dimensions are stored in a chain of tables, the databases design is called a Snowflake Design.
A relational database must perform time consuming joins each time a report executes, and a star design for a dimension requires fewer joins than a snowflake design.
48
CUSTOMER
PK CUSTKEY
NAMESTREETCITYSTATEZIP
SHIPMENTS
PK,FK4 PRODKEYPK INVOICE
FK1,I1 PERKEYFK2,I3 CUSTKEYFK3,I4 SHIPKEY
DOLLARSWEIGHT
PERIOD
PK PERKEY
MONTHYEARQUARTERTRIDATE_COL
PRODUCT
PK PRODKEY
PRODUCTDISTRIBUTORBERRYAROMAACIDBODYROAST
SHIPDATE
PK PERKEY
MONTHYEARQUARTERTRIDATE_COL
Stargood
SnowflakeBAD!!!!
D_PROD
I1 PROD_CODE
PROD_NAMEPOSITIONTYPEVERSION
Star V/s Snowflake Schema
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
49
Star Schema with Sample Data
50
TipSome times when we are designing a DW, it is unclear whether a numeric data field extracted from a production data source is a fact or an attribute.Simply ask yourself the question.Is the numeric data field a measurement that varies every time we sample it?
Or whether it is a discretely valued description of some thing that is more or less constant?
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
51
DataConnection(s) LayerETLQuery ToolsAnalysis ToolsPresentation InterfaceQuality Assurance procedures
*Politics*
Data Warehouse System
52
Basic Processes - Data WarehouseExtracting The first step of getting Data into the data warehouse.
Transformation Once data extracted into the data staging area, many possible transformation steps, including Cleaning the data, correcting misspelling, purging selected fields, Creating Surrogate keys for each dimension, Building Aggregates etc.
Loading and Indexing Loading in the data warehouse.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
53
Excel SpreadsheetsAccess databaseA plethora of other RDBMSs
Most of your work will be in the ETL, data staging area. This will make or break your project!
Consolidation of Disparate Data Sources
54
Basic Processes - Data WarehouseQuality Assurance Checking Quality assurance can be checked by running a comprehensive exception report over the entire new set of newly loaded data.
Release/Publishing - The User community must be notified that the new data is ready.
Updating Modern data marts may well be updated, sometimes frequently. Changes in labels, changes in hierarchies, changes in status, and changes in corporate ownership.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
55
Basic Processes - Data WarehouseQuerying Querying is abroad term that encompasses all the activities of requesting data from a data mart.
Data Feedback/Feeding in Reverse The data can also flow in the opposite direction uphill from the traditional flow we have discussed.
Auditing At times it is critically important to know where the data came from and what were the calculations performed. For this you can create special audit records.
56
Basic Processes - Data WarehouseSecuring - Every data warehouse has an exquisite dilemma: Publishing the data as widely to as many users as possible with the easiest of user interfaces, at the same time protect the data from misuse and snoopers.
Backing Up and Recovering Since data warehouse data is a flow of data from the legacy system on through to the data marts and eventually onto the users desktops, a real question arises about where to take the necessary snapshots.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
57
Core PiecesSelect Reporting Tool
Must be simple yet robust for ClientsPerformance, server/client work loadSecurity, server/client layers
Select ETL methodUse what you know bestEase of maintenance
58
Steps in the Design ProcessIt is good to approach the design for a data warehouse in a consistent way.You can archive this by following the four steps in a particular orderRemember the perspective necessary to actually make these decisions come from an understanding of the end user requirements and what is in the legacy data sources that are available to the data warehouseChoose a business process to modelChoose the grain of the business processChoose the dimensions and their attributesChoose the measured facts
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
59
Database Design Methodology for Data Warehouses
Nine-Step Methodology includes following steps:
Choosing the processChoosing the grainIdentifying and conforming the dimensionsChoosing the factsStoring pre-calculations in the fact tableRounding out the dimension tablesChoosing the duration of the databaseTracking slowly changing dimensionsDeciding the query priorities and the query modes.
60
Step 1: Choosing The Process
The process (function) refers to the subject matter of a particular data mart.
First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
61
ER Model of an Extended Version of DreamHome
62
ER Model of Property Sales Business Process of DreamHome
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
63
Step 2: Choosing The Grain
Decide what a record of the fact table is to represent.
Identify dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table.
Also include time as a core dimension, which is always present in star schemas.
64
GrainLevel of detail at which measures are recordedProvide meaning to a number stored in the fact tableFact= revenueDimension= day, sales person, productGrain= revenue per day per sales person per product
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
65
Step 3: Identifying and Conforming the Dimensions
Dimensions set the context for asking questions about the facts in the fact table.
If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other.
A dimension used in more than one data mart is referred to as being conformed.
66
Star Schemas for Property Sales and Property Advertising
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
67
Step 4: Choosing The Facts
The grain of the fact table determines which facts can be used in the data mart.
Facts should be numeric and additive.
Unusable facts include:non-numeric facts,
non-additive facts,
fact at different granularity from other facts in table.
68
Property Rentals with a Badly Structured Fact Table
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
69
Property Rentals with Fact Table Corrected
70
Step 5: Storing Pre-Calculations in the Fact Table
Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
71
Step 6: Rounding Out The Dimension Tables
Text descriptions are added to the dimension tables.
Text descriptions should be as intuitive and understandable to the users as possible.
Usefulness of a data mart is determined by the scope and nature of the attributes of the dimension tables.
72
Step 7: Choosing The Duration Of The Database
Duration measures how far back in time the fact table goes.
Very large fact tables raise at least two very significant data warehouse design issues.
Often difficult to source increasing old data.
It is mandatory that the old versions of the important dimensions be used, not the most current versions. Known as the Slowly Changing Dimension problem.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
73
Step 8: Tracking Slowly Changing Dimensions
Slowly changing dimension problem means that the proper description of the old dimension data must be used with old fact data.
Often, a generalized key must be assigned to important dimensions in order to distinguish multiple snapshots of dimensions over a period of time.
74
Step 8: Tracking Slowly Changing Dimensions
Three basic types of slowly changing dimensions:
Type 1, where a changed dimension attribute is overwritten.
Type 2, where a changed dimension attribute causes a new dimension record to be created.
Type 3, where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
75
Step 9: Deciding The Query Priorities And The Query Modes
Most critical physical design issues affecting the end-users perception includes:
physical sort order of the fact table on disk;
presence of pre-stored summaries or aggregations.
Additional physical design issues include administration, backup, indexing performance, and security.
76
Database Design Methodology for Data Warehouses
Methodology designs a data mart that supports requirements of particular business process and allows the easy integration with other related data marts to form the enterprise-wide data warehouse.
A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
77
Fact and Dimension Tables for each Business Process of DreamHome
78
Dimensional Model (Fact Constellation) for the DreamHome Data Warehouse
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
79
When I wish upon a Star
80
Are You FamiliarThe Goals of a Data WarehouseThe Chess PiecesDifferent worlds OLTP/Data warehouseDimensional Model BasicHierarchies in DimensionsThe Fact TableThe Star SchemaThe Snowflake SchemaBasic Processes of a Data warehouse
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
81
What Is ETL?
ExtractExtract -- the process of reading data from a outer database.
TransformTransform -- the process of converting extracted data to a form useable by the target database.
Occurs by using rules or lookup tables or by combining the data with other
data.
LoadLoad -- the process of writing the data into the target database.
82
What does ETL do?Extracts data from multiple data sources Migrates data from one DB to anotherConverts DB from one format or type to another.Transforms the data to make it accessible to business analysisForms data marts and data warehousesEnables loading of multiple target databases Performs at least three specific functions
reads data from an input source ;passes the stream of information through either an ETL engine-or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job;writes the resultant data set back out to a flat file, relational table, etc.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
83
What can ETL be used?To acquire a temporary subset of data (like a VIEW) for reports or other purposes.
A more permanent data set may be acquired for other purposes such as: the population of a data mart or data warehouse
84
ETL SYSTEM
Operational Data
Outer SourcesDifferent vendorDifferent format
ETL Engine
ExtractTransform
LoadFilter
Data Warehouse
Local Data Marts
Local Data Marts
Local Data Marts
Local Data Marts
OLAP End Users
OLAP End Users
OLAP End Users
OLAP End Users
Data extracted from the data warehouseprovide faster processing
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
85
Technical architecture designDesign of the technical environment to enable the logical design It is a description of the elements and services of the BI environmentA map of how the components will fit together and communicateBasically.a blue print by which the team, consultants, and vendors will build the Business Intelligence Environment
86
PA/PMSiemens/SMSDEC Alpha UnixSybase
Critical PathsLandacorpOracleHP-Unix
BudgetsCustomMainframeSAS
Cost ReportsContract monitoringMS - WindowsMS - Excel
Acquisition Services
Data Staging Services- Extraction- Transformation- Load- Cleansing
Data Staging Administration - Job/Process Control- Job/Process Monitoring- Metadata exchange- Data Modeling
LoadFiles
Data Staging Area
Data Warehouse
Organization Services
Metadata Services - Source/Target Models- Business Definitions- Audit Statistics- Performance Statistics- ETL Statistics
MetadataExchange
MetadataRepository
Consumption Services
Data MartOLAPMDB
Data MartRDBMS
Data Access Services- Report Library Management- Report Distribution- Report Scheduling- OLAP Cube Refreshing- Query Management- Aggregation Management- Security Verification- Metadata Navigation
Data Services- Bulk Data Loader- Aggregation Management- Index Management- Audit Statistics- DBA Administrator- Security Administration
Program EvaluationOLAP MDB
PerformanceBased Budgeting
RDBMS
Planned Services-Web Reporting- Web OLAP- Data Mining
Data WarehouseAdministration - Data Modeling- Data Access Tool Mgmt.- Data Base Administrator- Data Staging Administration
The architecture conceptual model
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
87
PA/PMSiemens/SMSDEC Alpha UnixSybase
Data Staging Services- Extraction- Transformation- Load- Cleansing
Data StagingAdministration
- Job/Process Control- Job/Process Monitoring- Metadata exchange- Data Modeling
LoadFiles
Data Staging Area
MetadataExchange
COSTSEclipsys/TSICompaq HPUXOracle
BUDGETSCustomMainframeSAS
PATHWAYSLandacorpIBM AIXOracle
OrganizationServices
Source Systems
Data acquisition services
88
Acquiring the data
PM/PA EMR AP/MM Home Solucient State
MR CDR
Etc.
GL/HRInternal &External
Data
Obstacles toIntegration
" Different data models" Different data definitions" Different data base systems
" Different computer platforms" Dirty data" Number of operational sources
1 Hand code extraction, transformation, cleansing, and loading services using the data manipulation language of choice (e.g., SAS, COBOL, MS DTS, Perl), most common approach especially for proprietary DSS data models
2 Buy acquisition services from an ETL software vendor and customize to your environment.
Approachesto Acquisition
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
89
ETL attributes = $$$$Multi threaded engines (e.g., Informatica, Cognos) or Code generation (e.g., ETI, SAS, DataStage)Number of Source/Target DBMS supportedNumber of computing platforms supported (1-tier, 2-tier, N-Tier)Change data captureBreadth of transformation techniquesMetadata drivenWhat metadata standard?Multiple data loading options (incremental, bulk, table management, partitioning)
90
CARLETON
INFORMATICAINFORMATICA
ETL technology - horizontal marketplace
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
91
" The large HIS vendors will adopt generic ETL technology and customize the functionality to their application portfolio and data bases.
" Horizontal ETL vendors MAY develop health care vendor portfolios such as they do for ERP vendors but that will depend on demandand if they survive.
" DBMS providers will increasingly provide powerful ETL solutions making any third-party tool obsolete, assuming you have a homogenous DBMS implementation.
" Addressing data quality will be the hardest process and tool set to sell to healthcare organizations.
" Transitioning from hard-coded interfaces to a metadata driven data acquisition environment will follow the typical healthcare technology adoption cycle, that is, a long time.
ETL technology predictions
92
Data Warehouse
Metadata Services- Source/Target Models- Business Definitions- Audit Statistics- Performance Statistics- ETL Statistics
MetadataRepository
Data Services- Bulk Data Loader- Aggregation Mgmt- Index Management- Audit Statistics- DBA Administrator- Security Admin
AcquisitionServices
LoadFiles
Organization services
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
93
ERWinEmbarcaderoOr DSS proprietary data models
Data Staging Services
- Extraction- Transformation- Load- Cleansing
LoadFiles
MetadataExchange
Source and Target data models are the center of a metadata driven environment.
Data modeling tools
94
Issues that are key to an effective ETL tool
Scheduling and job dependencies: particularly relies on graphical environment.
Session nesting: When developing an ETL session for a particular part of the system, nesting eliminates duplicate development.
Robust SQL support: Increases speed over using code to read and write to a database.
Version management: enables quick roll back rather than manually making code changes. In many cases, the DBs version control may not work on the ETL.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
95
Key Issues (Contd)
Debugging functionality: very useful for developer support.
ETL should rely on underlying database security.
Transformation capabilities vs. cleansing capabilities: seldom very strong in both.
Metadata support: must work with the overall metadata strategy.
96
Current ETL Market ShareTotal Market Share: $667 Million
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
97
ETL EvaluationThroughout the following sections, each of the vendors and their ETL products are evaluated, focusing on primary differences between such products.
Ascential Software Formed in July 2001
Focuses on improving, developing, and perfecting their ETL and back-end toolsDo not have current plans of entering the BI tool market.
The Ascential DataStage product familyhighly scalable ETL solution
uses end-to-end metadata management and data quality assurance functions.
can create and manage scalable, complex data integration for enterprise applications such as CRM, ERP, SCM, BI/analytics, E-business and data warehouses.
98
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
99
Cognos CorporationFounded in 1969Prefers that all components of the enterprise data warehouse are Cognos Products
DecisionStream easily integrates with Cognos BI tools, etc.has difficulty integrating with other vendor Products.
DecisionStream is powerful ETL softwareAllows users to extract and unite data from disparate sources and deliver coordinated Business Intelligence across your organization.includes advanced data merging, aggregation and transformation capabilities: let users unite data from differentsources, and transform it into information using best-practices dimensional design.
100
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
101
Informatica PowerConnectAn extension to Informatica PowerCenter, and PowerCenterRT data integration software.Eliminates the need for customers to manually code data extraction programs for their enterprise applications.Ensures that mission-critical operational data can be effectively used to inform key business decisions across the enterprise.Allows companies to directly source and integrate:
ERPCRMReal-time message queueMainframeAS/400Remote dataMetadata
with other enterprise data and deliver it to:Data warehousesOperational data storesBusiness intelligence toolsPackaged analytic applications.
102
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
103
ConclusionIssues analyzed:
development environmentsversion controlSecuritiesmetadata exchanges standardsCost
The ETL tools presented by Ascential and Informatica are comparable in numerous waysit would be best to select Informatica as an ETL vendor.
more mature and stable as a company
104
The Staging Area
How to Stock Your Data Warehouse Pantry
Christopher Richard[Data Warehousing System Architect]
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
105
All-You-Can-Eat BuffetBuffet (ODS, DW, DM)Recipe (Business/transformation rules)Kitchen (ETL)Ingredients from different suppliers (Source systems)Pantry (Staging Area)Our topic is the pantry the Staging Area, because it is the foundation & stepchild of Data Warehousing
106
Why have a pantry?Minimizing processing on source systems
Extract only onceData integrity
Source data within own controlIncrementalsFreedom of storage format and abstractionAudit trailPersistence of dataTiming flexibilityProcessing powerConsistent interface for downstream processes
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
107
Minimizing processing on source systemsExtract only once
Staging Area serves downstream systems, thus limiting impact to the source systemConsistent extract methodologyCentral knowledge base of source system extraction expertise
Data IntegrityProper timing of different extracts within source system schedulesBoth table-centric and document-centric extraction can be applied as necessary
108
Table-centric Vs Document-centric Extraction
2/1/2001
Order Date Order Amount
Order Number
!00.001000
2
1
Line Number
QtyProductOrder Number
20B1000
10A1000
1
Restart ID
2/1/2001
Order Date Order Amount
Order Number
100.001000
3
2
Restart ID
2
1
Line Number
QtyProductOrder Number
20B1000
10A1000
2/1/2001
Order Date Order Amount
Order Number
100.001000
2
1
Line Number
QtyProductOrder Number
20B1000
10A1000
2
1
Restart ID
2/1/2001
2/1/2001
Order Date
100.00
100.00
Order Amount
2
1
Line Number
QtyProductOrder Number
20B1000
10A1000
Source Staging AreaTable-centric
Document-centric
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
109
Incremental Source ExtractionReliable Change Identifier
Ever increasing numberTimestamp
Correlated Change IdentifierChange LogDont Forget about deletes
Hard deletesSoft deletes
110
Incrementals ImplementationCyclic Redundancy Checksum
Calculate for extracted incrementTrue delta identification, should precede all other items
Data Manipulation Language Code [Insert, Update, Delete]Propagatable after reassessment
Column Change BitmapEasy identification for downstream systems (Type 2 SCD)
Restart Identifier [Bookmark]An ever-increasing number unique in the whole Staging AreaUsed to quickly identify the records not yet processed by downstream systems
Source Key Identifier [1:1 with source key]An ever-increasing number unique for a particular source key, in the whole Staging Area Multiple per source key allowed to support source key re-use
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
111
Column Change Bitmap Example
Shoe
Product Type ColorProduct
BlueA
001
Change Bitmap
24
Restart ID
Shoe
Product Type
ColorProduct
RedAShoe
Product Type ColorProduct
RedA
2/1/2001
Effective Date PriceProduct
50.00A
5/1/2001
Effective Date PriceProduct
55.00A 011
Change Bitmap
49
Restart ID
5/1/2001
Effective Date
PriceProduct
55.00A
Shoe
Product Type
0011
Change Bitmap
24
Restart ID
5/1/2001
Effective Date
PriceProduct
55.00A
Source Tables
Staging Area Tables
Data Mart Table
112
Audit TrailTrack data lineage
Track data movement across tables and systemsTry to tag the data as soon as it enters the stream
Track data changesTrack data changes within a tableAutomate data change tracking outside of coding discipline wherever possible
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
113
Audit Trail - ImplementationPropagation of the identifiers to downstream processes
Restart IdentifierSource Key IdentifierSource System Identifier
Table specific audit dataJob Run IdentifierSource extract date & timeCreate and change date & time and userColumn Change Bitmap
114
Key learnings from doingTrue delta determination is essential for large data volumes and Type II/III Slowly Changing DimensionsYou will have to compromise functionality for performanceYou will have to compromise data completeness for performanceAllow staging tables to differ in design from the source tablesCookie cutters do work
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
115
Key learnings from doingUse one sequencer for all surrogate keysImplement complete pieces of logic as early in the process stream as possible, so downstream processes can benefit from it in the most timely mannerSet processing may lead to seeking alternative storage optionsUse a sounding board
116
Data StagingThe Data Staging Process is the iceberg of the data warehouse project.While an iceberg looks formidable from the ships helm, we often dont gain a full appreciation of its magnitude until we collide with itSo many challenges are buried in the data sources and the systems they run on that this part of the process invariable takes much more time than you expect.The concepts and approach in this training apply to both hand-coded staging systems and data staging tools
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
117
Data StagingTakes data from the operational systems and prepares it for dimensional model in the data presentation area.It is a backroom service and not a query service.Unfortunately many teams focus on the E and L of ETLThe E does have its challenges.But most of the heavy lifting occurs in theT
118
TransformationCombine dataDeal with quality issuesIdentify updated dataManage surrogate keysBuild aggregatesHandle errors
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
119
Getting StartedFor once I will skip our primary mantra of focus on the business requirements and present our second-favorite aphorismMAKE A PLAN
Do we need to use a ToolYou need to decide earlyDo not expect to recoup your investment on the first iteration due to the learning curve.A tool would provide greater metadata integration and enhanced flexibility, reusability, and maintainability in the long run.
120
Dimensional Data StagingExtract Dimensional Data from Operational SystemsCleanse attribute values
Name and address parsingInconsistent descriptive valuesMissing decodesOverloaded codes with multiple meaning over timeInvalid dataMissing data
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
121
Dimensional Data StagingManage surrogate key assignments
Since we maintain surrogate keys in the warehouse we must maintain a persistent master cross-reference table in the staging area for each dimensionThe cross reference table keeps track of the surrogate key assigned to an operational key at a point in time along with the attribute profile.We interrogate the extracted dimensional source data to determine whether it is new dimension row, an update to an existing row, or neither.New records are identified easily because the operational source key is not maintained in the master cross reference table
122
master cross reference table
Most Recent Cyclic Redundancy checksum(CRC)
Most recent Dimension Row Indicator
Dimension row Expiration Date
Dimension row effective date
Dimension Attribute 1-N
Operational Source Key
Surrogate Dimension Key
Master Dimension Cross Reference table
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
123
To quickly determine if rows have changed, we rely on cyclic redundancy checksum(CRC) algorithm.If the CRC is identical for the extracted record and the most recent row of the master cross- reference table, then we ignore the extracted recordIf the CRC differs then we need to study each column to determine whats changed and then how the change will be handled.Type 1/ Type2/Type 3The final Step is to update the most recent surrogate key assignment table.This table consists of OS Keys and its most recent assigned surrogate keys to act as a fast look up.
Dimensional Data Staging
124
Dimensional table Surrogate Key management
SourceExtract
CRC COMPARE
Master DimCross-Ref
Assign surrogate Keys & set
dates/Indicator
Ignore
UpdatePrior most recentrow
Assign surrogate Keys & set
dates/Indicator
UpdateDimension
Master DimCross-Ref
Most RecentKey
Assignment
New SourceRows
No CRCCHANGE
CHANGEDRows
Type1 or 3
Type1 or 3
Insert
Update
Update
Update
Insert
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
125
Dimension Data StagingBuild dimension row load images and publish revused dataOnce the dimension table reflects the most recent extract(and has been confidently quality assured), it is published to all data marts that use dimensions.
126
Fact Table StagingExtract fact data from operational sourcesReceive updated dimensions from the dimension authoritiesSeparate the fact data by granularity as requiredTransform the fact table as requiredReplace the operational source keys with surrogate keys
We use the most recent surrogate key assignment table created by the dimension authority to do this.
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
127
Fact Table StagingAdd additional keys for known context.Quality assure the fact table dataConstruct or update aggregation fact tablesBulk load the dataAlert the users
128
Microsoft owerPoint Presentatio
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
129
Smarter Business Intelligenceoutsmarting to be number #1
Informatica Corporation April 23, 2003
130
Business Imperatives
Changing markets forcing products to evolve or innovate
Changing competitive landscape forcing strategies to change
Changing economies forces organizations to contract and be effective
Changing financial drivers geared towards profitability
Changing market positioning to leadership to be NUMBER 1!
Forces all companies to think smarter than ever!
Application
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
131
Business Imperatives
Smarter.Marketing campaignsProducts and positioningGo-to-market strategiesFinancial investmentsLead to Sales generation cyclePeople!
132
Business Imperatives
The Challenge:
Making people think smarter
Expensive!
Impossible!
Not worth the effort!
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
133
Business Imperatives
The Solution:
Business Intelligence Initiatives Enterprise Data Warehouse Project Balance Scorecard Systems EIS (Executive Information System) Project Management Cockpit Infrastructure
Business Analytics Platform
134
Business Analytics Solutions Often Include Multiple Tools And Technologies
Extract, transform and load data into the warehouse
DataIntegration
Organize and store transaction information
DataWarehouse
Provide end-users with reports and ad hoc access to
the data in the warehouse
BusinessIntelligence
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
135
Informatica Business Analytics SuiteModular Plug-&-Play Approach Offers Best of Buy and Build
136
Market Leaders Rely on Informatica80%+ of the Fortune 10080%+ of the Dow Jones Industrial AverageGlobal Reach
Entertainment - The 5 LargestTelecommunications - 13 of the Top 14Financial Services - 12 of the Top 15Pharmaceutical - 12 of the Top 13Utilities - 15 of the Top 20Insurance - 16 of the Top 21Manufacturing - 12 of the Top 16
All 4 branches of the US Armed Forces
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
137
Boosting productivity
By visually defining mappings and transformation through an easy to use GUI, we have been able to significantly reduce data warehouse maintenance and support costs. In fact, we now only have one resource managing a half-terabyte data warehouse.
Grady BoggsData Warehouse Manager
At Hewlett-Packard, we are always looking for innovate ways to leverage technology to improve productivity and using Informatica, we have seen an over 75 percent improvement in development productivity and time to market.
Rudy GarzaData Architect
We have achieved very rapid time-to-deployment with Informatica, and the resulting increase in our operational and analytic capabilities will drive increased value and savings for Deluxe.Through automated replication processes and streamlined workflow, we anticipates a $6 million annual reduction in data-maintenance costs.
Andy FieldSenior Director
138
Thrifty improves productivity by over 75%Challenge:
Systems difficult to maintain through lack of updated and accuraterecords of how, why, and where data was transferredHeavy reliance on code resulted in limited transformation capabilities and flexibility to deal with changes in business requirements Develop a metadata strategy promoting reuse proved to be difficult
Solution:Single console for design, development, testing, daily management, scheduling, and smart recovery after failed components Simple operation, and evolutionObject-oriented, user friendly interface with over 100 built in transformations and robust visual debuggerUse of wizards to visually go through error-prone and repetitive tasks
Results:Integrated product suite enables rapid development and time to marketActive and automated metadata solution, promoting reuseROI in under a year
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
139
Delivering on the Performance Promise
One of the main drivers behind the success of our very high performance, highly scalable enterprise data warehouse has been the performance and scalability of PowerCenter.PowerCenters performance gives us the confidence to scale our data warehouse into the 10-20 Terabyte range in the years ahead.
Mark CothronData Warehouse Architect
Informatica's performance capabilities and scalability immediately lifted it over the competition.Using Informatica we have created a multi-terabyte data warehouse and the analysis and action-enabling information this system provides has given us a competitive advantage that can't be matched.
Patrick FirouzianDirector
140
PepsiCo creates 3 data warehouses in excess of 1 TB
Informaticas performance has been superb and we have seen drastic improvements with each new release.We are always looking to get information into the hands of our business users quicker and more efficiently and using Informatica we have over 30 data integration projects, with the largest being a 7 Terabyte data warehouse. Wendy Faegre
Systems Manager
!Results:Largest data warehouse > 7 TB and easily loads in 3 hour batch windowProcess over 60 GBs daily and 800 GBs monthly
throughput exceed 30 GB/hour70 % improvement in performance over code
-
FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
141
Informatica Overview
Corporate ! Founded (1993); Nasdaq: INFA (1999)! Over 800 employees worldwide
Financials! 2000: $154 million revenue
! 2001: $197 million revenue
! 2002: $195 million revenue
Partners! Over 200 sales, marketing and implementation partners
! Including: i2, PeopleSoft, Big 5, Siebel, SAP; Mitsubishi
Products! Industry-leading solutions for deploying business
analytics across the enterprise:
- Data integration - Data Warehouses - Business Intelligence - Analytic Applications
Customers ! Over 1700 worldwide! 80 of the Fortune 100 and 80% of Dow Jones