dwdm single ppt notes

169
DATA WAREHOUSING AND DATA MINING S. Sudarshan Krithi Ramamritham IIT Bombay [email protected] [email protected] www.jntuworld.com www.jntuworld.com www.jwjobs.net 

Upload: dinakar-prasad

Post on 29-Oct-2015

57 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 1/169

DATA WAREHOUSINGANDDATA MINING

S. Sudarshan

Krithi Ramamritham

IIT Bombay 

[email protected]

[email protected]

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Page 2: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 2/169

2

Course Overview

The course:what and how

0. Introduction I. Data Warehousing

II. Decision Support

and OLAP III. Data Mining

IV. Looking Ahead

Demos and Labs

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Page 3: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 3/169

3

0. Introduction

Data Warehousing,OLAP and data mining:

what and why

(now)? Relation to OLTP

A case study

demos, labs

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

Page 4: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 4/169

4

Which are ourlowest/highest margin

customers ?

Who are my customersand what products

are they buying?

Which customers

are most likely to goto the competition ? 

What impact willnew products/services

have on revenue

and margins?

What product prom-

-otions have the biggestimpact on revenue?

What is the mosteffective distribution

channel?

A producer wants to know….

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

ld b

Page 5: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 5/169

5

Data, Data everywhere yet ...

I can’t find the data I need

data is scattered over thenetwork

many versions, subtledifferences

I can’t get the data I need need an expert to get the data

I can’t understand the data Ifound

available data poorly documented

I can’t use the data I found results are unexpected

data needs to be transformedfrom one form to other

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

j ld j j b

Page 6: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 6/169

6

What is a Data Warehouse?

A single, complete andconsistent store of dataobtained from a variety

of different sourcesmade available to endusers in a what theycan understand and use

in a business context.

[Barry Devlin]

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

j t ld j j b t

Page 7: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 7/169

7

What are the users saying...

Data should be integratedacross the enterprise

Summary data has a real

value to the organization

Historical data holds thekey to understanding data

over time What-if capabilities are

required

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

j t ld j j b t

Page 8: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 8/169

8

What is Data Warehousing?

  A process of 

transforming data intoinformation andmaking it available tousers in a timelyenough manner to

make a difference

[Forrester Research, April1996]Data

Information

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 9: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 9/169

9

Evolution

60’s: Batch reports hard to find and analyze information

inflexible and expensive, reprogram every new

request

70’s: Terminal-based DSS and EIS (executiveinformation systems) still inflexible, not integrated with desktop tools

80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs

easier to use, but only access operational databases

90’s: Data warehousing with integrated OLAP

engines and tools

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 10: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 10/169

10

Warehouses are Very LargeDatabases

35%

30%

25%

20%

15%

10%

5%

0%5GB

5-9GB

10-19GB 50-99GB 250-499GB

20-49GB 100-249GB 500GB-1TB

Initial

Projected 2Q96

Source: META Group, Inc.

      R    e    s    p    o    n      d    e    n      t    s

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 11: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 11/169

11

Very Large Data Bases

Terabytes -- 10^12 bytes:

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21

bytes:

Zottabytes -- 10^24bytes:

Walmart -- 24 Terabytes

Geographic Information

SystemsNational Medical Records

Weather images

Intelligence AgencyVideos

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 12: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 12/169

12

Data Warehousing --It is a process

Technique for assembling andmanaging data from varioussources for the purpose of 

answering businessquestions. Thus makingdecisions that were notprevious possible

A decision support databasemaintained separately fromthe organization’s operational

database

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 13: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 13/169

13

Data Warehouse

A data warehouse is a

subject-oriented

integrated time-varying

non-volatile

collection of data that is used primarily inorganizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 14: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 14/169

14

Explorers, Farmers and Tourists

Explorers: Seek out the unknown andpreviously unsuspected rewards hidingin the detailed data

Farmers: Harvest informationfrom known access paths

 Tourists: Browse informationharvested by farmers

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 15: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 15/169

15

Data Warehouse Architecture

Data Warehouse

Engine

Optimized Loader 

Extraction

Cleansing

 Analyze

Query

Metadata Repository

Relational

Databases

Legacy

Data

Purchased

Data

ERPSystems

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www jntuworld com www jwjobs net

Page 16: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 16/169

16

Data Warehouse for DecisionSupport & OLAP

Putting Information technology to help the

knowledge worker make faster and better

decisions

Which of my customers are most likely to go

to the competition?

What product promotions have the biggest

impact on revenue? How did the share price of software

companies correlate with profits over last 10

years?

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www.jntuworld.com www.jwjobs.net

Page 17: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 17/169

17

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update Use of the system is loosely defined and

can be ad-hoc

Used by managers and end-users to

understand the business and make

 judgements

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www.jntuworld.com www.jwjobs.net

Page 18: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 18/169

18

Data Mining works with WarehouseData

Data Warehousingprovides the Enterprisewith a memory

Data Mining providesthe Enterprise withintelligence

www.jntuworld.com

www.jntuworld.com

www.jwjobs.net

www.jntuworld.com www.jwjobs.net

Page 19: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 19/169

19

We want to know ... Given a database of 100,000 names, which persons are the

least likely to default on their credit cards?

Which types of transactions are likely to be fraudulentgiven the demographics and transactional history of aparticular customer?

If I raise the price of my product by Rs. 2, what is theeffect on my ROI?

If I offer only 2,500 airline miles as an incentive topurchase rather than 5,000, how many lost responses willresult?

If I emphasize ease-of-use of the product as opposed to itstechnical capabilities, what will be the net effect on myrevenues?

Which of my customers are likely to be the most loyal? 

Data Mining helps extract such information

www j w

www.jntuworld.com

www jwj

www.jntuworld.com www.jwjobs.net

Page 20: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 20/169

20

Application Areas

Industry Application

Finance Credit Card Analysis

Insurance Claims, Fraud Analysis Telecommunication Call record analysis

 Transport Logistics management

Consumer goods promotion analysis

Data Service providersValue added data

Utilities Power usage analysis

j

www.jntuworld.com

j j

www.jntuworld.com www.jwjobs.net

Page 21: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 21/169

21

Data Mining in Use

The US Government uses Data Mining totrack fraud

A Supermarket becomes an information

broker Basketball teams use it to track game

strategy

Cross Selling Warranty Claims Routing

Holding on to Good Customers

Weeding out Bad Customers

j

www.jntuworld.com

j j

www.jntuworld.com www.jwjobs.net

Page 22: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 22/169

22

What makes data mining possible?

Advances in the following areas aremaking data mining deployable:

data warehousing

better and more data (i.e., operational,behavioral, and demographic)

the emergence of easily deployed data

mining tools and the advent of new data mining

techniques.• -- Gartner Group

j

www.jntuworld.com

j j

www.jntuworld.com www.jwjobs.net

Page 23: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 23/169

23

Why Separate Data Warehouse?

Performance Op dbs designed & tuned for known txs & workloads.

Complex OLAP queries would degrade perf. for op txs.

Special data organization, access & implementationmethods needed for multidimensional views & queries.

Function Missing data: Decision support requires historical data, which

op dbs do not typically maintain. Data consolidation: Decision support requires consolidation

(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.

Data quality: Different sources typically use inconsistent data

representations, codes, and formats which have to bereconciled.

j

www.jntuworld.com

j j

www.jntuworld.com www.jwjobs.net

Page 24: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 24/169

24

What are Operational Systems?

They are OLTP systems

Run mission criticalapplications

Need to work withstringent performancerequirements for

routine tasks Used to run a

business!

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 25: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 25/169

25

RDBMS used for OLTP

Database Systems have been usedtraditionally for OLTP

clerical data processing tasks

detailed, up to date data

structured repetitive tasks

read/update a few records

isolation, recovery and integrity arecritical

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 26: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 26/169

26

Operational Systems

Run the business in real time

Based on up-to-the-second data

Optimized to handle largenumbers of simple read/writetransactions

Optimized for fast response topredefined transactions

Used by people who deal with

customers, products -- clerks,salespeople etc.

They are increasingly used bycustomers

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 27: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 27/169

27

Examples of Operational Data

Data IndustryUsage Technology Volumes

CustomerFile

All TrackCustomerDetails

Legacy application, flatfiles, main frames

Small-medium

AccountBalance

Finance Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data

Retail Generatebills, managestock

ERP, Client/Server,relational databases

Very Large

CallRecord

 Telecomm-unications

Billing Legacy application,hierarchical database,mainframe

Very Large

ProductionRecord

Manufact-uring

ControlProduction

ERP,relational databases,AS/400

Medium

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 28: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 28/169

So, what’s different?

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 29: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 29/169

29

Application-Orientation vs.Subject-Orientation

Application-Orientation

Operation

alDatabase

LoansCreditCard

 Trust

Savings

Subject-Orientation

Data

Warehouse

Customer

Vendor

Product

Activity

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 30: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 30/169

30

OLTP vs. Data Warehouse

OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data

warehouse Special data organization, access methods

and implementation methods are neededto support data warehouse queries(typically multidimensional queries) e.g., average amount spent on phone calls

between 9AM-5PM in Pune during the monthof December 

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 31: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 31/169

31

OLTP vs Data Warehouse

OLTP Application

Oriented

Used to runbusiness

Detailed data

Current up to date

Isolated Data Repetitive access

Clerical User

Warehouse (DSS) Subject Oriented

Used to analyze

business Summarized and

refined

Snapshot data

Integrated Data Ad-hoc access

Knowledge User(Manager)

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 32: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 32/169

32

OLTP vs Data Warehouse

OLTP Performance Sensitive

Few Records accessed at

a time (tens)

Read/Update Access

No data redundancy

Database Size 100MB-100 GB

Data Warehouse Performance relaxed

Large volumes accessed

at a time(millions) Mostly Read (Batch

Update)

Redundancy present

Database Size

100 GB - few terabytes

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 33: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 33/169

33

OLTP vs Data Warehouse

OLTP Transaction

throughput is the

performance metric Thousands of users

Managed in entirety

Data Warehouse Query throughput is

the performance

metric Hundreds of users

Managed bysubsets

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 34: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 34/169

34

To summarize ...

OLTP Systems areused to “run” abusiness

The DataWarehouse helpsto “optimize” thebusiness

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 35: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 35/169

35

Why Now?

Data is being produced

ERP provides clean data

The computing power is available The computing power is affordable

The competitive pressures are strong

Commercial products are available

www.jntuworld.com

M h di OLAP Swww.jntuworld.com www.jwjobs.net

Page 36: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 36/169

36

Myths surrounding OLAP Serversand Data Marts

Data marts and OLAP servers are departmental

solutions supporting a handful of users

Million dollar massively parallel hardware is

needed to deliver fast time for complex queries OLAP servers require massive and unwieldy

indices

Complex OLAP queries clog the network with

data

Data warehouses must be at least 100 GB to be

effective

– Source -- Arbor Software Home Page

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 37: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 37/169

37

Wal*Mart Case Study

Founded by Sam Walton

One the largest Super Market Chains

in the US

Wal*Mart: 2000+ Retail Stores

SAM's Clubs 100+Wholesalers Stores

This case study is from Felipe Carino’s (NCRTeradata) presentation made at Stanford DatabaseSeminar

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 38: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 38/169

38

Old Retail Paradigm

Wal*Mart Inventory

Management

Merchandise AccountsPayable

Purchasing

Supplier Promotions:

National, Region, StoreLevel

Suppliers Accept Orders

Promote Products

Provide specialIncentives

Monitor and TrackThe Incentives

Bill and CollectReceivables

Estimate RetailerDemands

www.jntuworld.com

N (J t I Ti ) R t ilwww.jntuworld.com www.jwjobs.net

Page 39: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 39/169

39

New (Just-In-Time) RetailParadigm

No more deals

Shelf-Pass Through (POS Application) One Unit Price

Suppliers paid once a week on ACTUAL items sold

Wal*Mart Manager Daily Inventory Restock

Suppliers (sometimes SameDay) ship to Wal*Mart

Warehouse-Pass Through Stock some Large Items

Delivery may come from supplier

Distribution Center Supplier’s merchandise unloaded directly onto Wal*Mart

Trucks

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 40: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 40/169

40

Wal*Mart System

NCR 5100M 96Nodes;

Number of Rows:

Historical Data:

New Daily Volume:

Number of Users:

Number of Queries:

24 TB Raw Disk; 700 -1000 Pentium CPUs

> 5 Billions

65 weeks (5 Quarters)

Current Apps: 75 Million

New Apps: 100 Million +

Thousands

60,000 per week 

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 41: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 41/169

41

Course Overview

0. Introduction

I. Data Warehousing

II. Decision Supportand OLAP

III. Data Mining

IV. Looking Ahead

Demos and Labs

www.jntuworld.com

I D t W hwww.jntuworld.com www.jwjobs.net

Page 42: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 42/169

42

I. Data Warehouses:Architecture, Design & Construction

DW Architecture

Loading, refreshing

Structuring/Modeling

DWs and Data Marts

Query Processing

demos, labs

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 43: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 43/169

43

Data Warehouse Architecture

Data Warehouse

Engine

Optimized Loader 

Extraction

Cleansing

 Analyze

Query

Metadata Repository

Relational

Databases

Legacy

Data

Purchased

Data

ERPSystems

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 44: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 44/169

44

Components of the Warehouse

Data Extraction and Loading

The Warehouse

Analyze and Query -- OLAP Tools

Metadata

Data Mining tools

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 45: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 45/169

Loading the Warehouse

Cleaning the data

before it is loaded

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 46: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 46/169

46

Source Data

Typically host based, legacy

applications Customized applications, COBOL,

3GL, 4GL

Point of Contact Devices POS, ATM, Call switches

External Sources Nielsen’s, Acxiom, CMIE,

Vendors, Partners

Sequential Legacy Relational ExternalOperational/

Source Data

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 47: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 47/169

47

Data Quality - The Reality

Tempting to think creating a datawarehouse is simply extractingoperational data and entering into adata warehouse

Nothing could be farther from thetruth

Warehouse data comes fromdisparate questionable sources

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 48: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 48/169

48

Data Quality - The Reality

Legacy systems no longer documented

Outside sources with questionable quality

procedures

Production systems with no built in

integrity checks and no integration

Operational systems are usually designed to

solve a specific business problem and arerarely developed to a a corporate plan

 “And get it done quickly, we do not have time to

worry about corporate standards...” 

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 49: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 49/169

49

Data Integration Across Sources

 Trust Credit cardSavings Loans

Same datadifferent name

Different dataSame name

Data found herenowhere else

Different keyssame data

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 50: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 50/169

50

Data Transformation Example

      e       n      c       o 

        d         i      n

      g  

      u       n

        i       t

        f        i

      e         l        d 

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feet

appl D - pipeline - yds

appl A - m,f appl B - 1,0appl C - x,yappl D - male, female

Data Warehouse

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 51: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 51/169

51

Data Integrity Problems

Same person, different spellings

Agarwal, Agrawal, Aggarwal etc...

Multiple ways to denote company name

Persistent Systems, PSPL, Persistent Pvt.LTD.

Use of different names

mumbai, bombay

Different account numbers generated by

different applications for the same customer Required fields left blank

Invalid product codes collected at point of sale

manual entry leads to mistakes

 “in case of a problem use 9999999” www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 52: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 52/169

52

Data Transformation Terms

Extracting

Conditioning

Scrubbing

Merging

Householding

Enrichment

Scoring

Loading

Validating

Delta Updating

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 53: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 53/169

53

Data Transformation Terms

Extracting

Capture of data from operational source in

 “as is” status

Sources for data generally in legacymainframes in VSAM, IMS, IDMS, DB2; more

data today in relational databases on Unix

Conditioning

The conversion of data types from the source

to the target data store (warehouse) --

always a relational database

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 54: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 54/169

54

Data Transformation Terms

Householding

Identifying all members of a household(living at the same address)

Ensures only one mail is sent to ahousehold

Can result in substantial savings: 1 lakh

catalogues at Rs. 50 each costs Rs. 50lakhs. A 2% savings would save Rs. 1lakh.

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 55: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 55/169

55

Data Transformation Terms

Enrichment Bring data from external sources to

augment/enrich operational data. Data

sources include Dunn and Bradstreet, A.C. Nielsen, CMIE, IMRA etc...

Scoring computation of a probability of an

event. e.g..., chance that a customerwill defect to AT&T from MCI, chancethat a customer is likely to buy a newproduct

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 56: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 56/169

56

Loads

After extracting, scrubbing, cleaning,validating etc. need to load the datainto the warehouse

Issues huge volumes of data to be loaded

small time window available when warehouse can betaken off line (usually nights)

when to build index and summary tables

allow system administrators to monitor, cancel, resume,change load rates

Recover gracefully -- restart after failure from whereyou were and without loss of data integrity

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 57: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 57/169

57

Load Techniques

Use SQL to append or insert newdata

record at a time interface

will lead to random disk I/O’s

Use batch load utility

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 58: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 58/169

58

Load Taxonomy

Incremental versus Full loads

Online versus Offline loads

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 59: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 59/169

59

Refresh

Propagate updates on source data tothe warehouse

Issues:

when to refresh

how to refresh -- refresh techniques

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 60: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 60/169

60

When to Refresh?

periodically (e.g., every night, every

week) or after significant events

on every update: not warranted unless

warehouse data require current data (up

to the minute stock quotes)

refresh policy set by administrator based

on user needs and traffic

possibly different policies for different

sources

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 61: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 61/169

61

Refresh Techniques

Full Extract from base tables

read entire source table: too expensive

maybe the only choice for legacy

systems

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 62: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 62/169

62

How To Detect Changes

Create a snapshot log table to recordids of updated rows of source dataand timestamp

Detect changes by:

Defining after row triggers to updatesnapshot log when source table changes

Using regular transaction log to detectchanges to source data

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 63: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 63/169

63

Data Extraction and Cleansing

Extract data from existingoperational and legacy data

Issues: Sources of data for the warehouse Data quality at the sources

Merging different data sources

Data Transformation

How to propagate updates (on the sources) tothe warehouse

Terabytes of data to be loaded

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 64: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 64/169

64

Scrubbing Data

Sophisticatedtransformation tools.

Used for cleaning the

quality of data Clean data is vital for thesuccess of the warehouse

Example

Seshadri, Sheshadri,Sesadri, Seshadri S.,Srinivasan Seshadri, etc.are the same person

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 65: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 65/169

65

Scrubbing Tools

Apertus -- Enterprise/Integrator

Vality -- IPE

Postal Soft

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 66: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 66/169

Structuring/Modeling Issues

www.jntuworld.com

Data -- Heart of the Datawww.jntuworld.com www.jwjobs.net

Page 67: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 67/169

67

Data -- Heart of the DataWarehouse

Heart of the data warehouse is thedata itself!

Single version of the truth

Corporate memory

Data is organized in a way that

represents business -- subjectorientation

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 68: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 68/169

68

Data Warehouse Structure

Subject Orientation -- customer,product, policy, account etc... Asubject may be implemented as a

set of related tables. E.g.,customer may be five tables

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 69: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 69/169

69

Data Warehouse Structure

base customer (1985-87) custid, from date, to date, name, phone, dob

base customer (1988-90) custid, from date, to date, name, credit rating,

employer

customer activity (1986-89) -- monthlysummary

customer activity detail (1987-89) custid, activity date, amount, clerk id, order no

customer activity detail (1990-91) custid, activity date, amount, line item no, order no

Time isTime is

 part of  part o f 

key of key of each tableeach table

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 70: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 70/169

70

Data Granularity in Warehouse

Summarized data stored

reduce storage costs

reduce cpu usage

increases performance since smallernumber of records to be processed

design around traditional high level

reporting needs tradeoff with volume of data to be

stored and detailed usage of data

www.jntuworld.com

l h

www.jntuworld.com www.jwjobs.net

Page 71: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 71/169

71

Granularity in Warehouse

Can not answer some questions withsummarized data

Did Anand call Seshadri last month? Not

possible to answer if total duration of calls by Anand over a month is onlymaintained and individual call detailsare not.

Detailed data too voluminous

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 72: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 72/169

72

Granularity in Warehouse

Tradeoff is to have dual level of granularity

Store summary data on disks 95% of DSS processing done against this

data

Store detail on tapes

5% of DSS processing against this data

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 73: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 73/169

73

Vertical Partitioning

Frequently

accessed Rarelyaccessed

Smaller tableand so lessI/O

Acct.No

Name BalanceDate OpenedInterest

RateAddress

Acct.No

BalanceAcct.No

Name Date OpenedInterest

RateAddress

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 74: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 74/169

74

Derived Data

Introduction of derived (calculateddata) may often help

Have seen this in the context of duallevels of granularity

Can keep auxiliary views andindexes to speed up query

processing

www.jntuworld.com

S h D i

www.jntuworld.com www.jwjobs.net

Page 75: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 75/169

75

Schema Design

Database organization must look like business

must be recognizable by business user

approachable by business user Must be simple

Schema Types

Star Schema Fact Constellation Schema

Snowflake schema

www.jntuworld.com

Di i T bl

www.jntuworld.com www.jwjobs.net

Page 76: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 76/169

76

Dimension Tables

Dimension tables Define business in terms already

familiar to users

Wide rows with lots of descriptive text Small tables (about a million rows)

Joined to fact table by a foreign key

heavily indexed

typical dimensions time periods, geographic region (markets,

cities), products, customers, salesperson,etc.

www.jntuworld.com

F t T bl

www.jntuworld.com www.jwjobs.net

Page 77: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 77/169

77

Fact Table

Central table

mostly raw numeric items

narrow rows, a few columns at most

large number of rows (millions to abillion)

Access via dimensions

www.jntuworld.com

St S h

www.jntuworld.com www.jwjobs.net

Page 78: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 78/169

78

Star Schema

A single fact table and for eachdimension one dimension table

Does not capture hierarchies directly

T i

m

e

 pr od

cust 

cit 

 y 

act 

date, custno, prodno, cityname, ...

www.jntuworld.com

S fl k h

www.jntuworld.com www.jwjobs.net

Page 79: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 79/169

79

Snowflake schema

Represent dimensional hierarchy directlyby normalizing tables.

Easy to maintain and saves storage

T i

m

e

 pr od

cust 

cit 

 y 

f act 

date, custno, prodno, cityname, ...

r egionwww.jntuworld.com

F t C t ll ti

www.jntuworld.com www.jwjobs.net

Page 80: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 80/169

80

Fact Constellation

Fact Constellation

Multiple fact tables that share manydimension tables

Booking and Checkout may share manydimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer 

Booking

Checkout 

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 81: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 81/169

81

De-normalization

Normalization in a data warehousemay lead to lots of small tables

Can lead to excessive I/O’s sincemany tables have to be accessed

De-normalization is the answerespecially since updates are rare

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 82: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 82/169

82

Creating Arrays

Many times each occurrence of a sequence of data is in a different physical location

Beneficial to collect all occurrences together

and store as an array in a single row Makes sense only if there are a stable

number of occurrences which are accessedtogether

In a data warehouse, such situations arisenaturally due to time based orientation can create an array by month

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 83: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 83/169

83

Selective Redundancy

Description of an item can be storedredundantly with order table --most often item description is also

accessed with order table Updates have to be careful

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 84: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 84/169

84

Partitioning

Breaking data into severalphysical units that can behandled separately

Not a question of whether  to do it in datawarehouses but how to doit

Granularity andpartitioning are key toeffective implementationof a warehouse

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 85: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 85/169

85

Why Partition?

Flexibility in managing data

Smaller physical units allow

easy restructuring

free indexing

sequential scans if needed

easy reorganization

easy recovery

easy monitoring

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 86: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 86/169

86

Criterion for Partitioning

Typically partitioned by

date

line of business

geography

organizational unit

any combination of above

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 87: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 87/169

87

Where to Partition?

Application level or DBMS level

Makes sense to partition atapplication level

Allows different definition for each year Important since warehouse spans many

years and as business evolves definitionchanges

Allows data to be moved betweenprocessing complexes easily

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 88: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 88/169

Data Warehouse vs. Data Marts

What comes first

www.jntuworld.com

From the Data Warehouse to Datawww.jntuworld.com www.jwjobs.net

Page 89: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 89/169

89

Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 90: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 90/169

90

Data Warehouse and Data Marts

OLAPData MartLightly summarizedDepartmentally structured

Organizationally structuredAtomicDetailed Data Warehouse Data

www.jntuworld.com

Characteristics of thewww.jntuworld.com www.jwjobs.net

Page 91: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 91/169

91

Departmental Data Mart

OLAP

Small

Flexible

Customized byDepartment

Source is

departmentallystructured datawarehouse

www.jntuworld.com

Techniques for Creatingwww.jntuworld.com www.jwjobs.net

Page 92: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 92/169

92

Departmental Data Mart

OLAP

Subset

Summarized

Superset

Indexed Arrayed

Sales Mktg.Finance

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 93: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 93/169

93

Data Mart Centric

Data Marts

Data Sources

Data Warehouse

www.jntuworld.com

Problems with Data Mart Centricwww.jntuworld.com www.jwjobs.net

Page 94: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 94/169

94

Solution

If you end up creating multiplewarehouses, integrating them is aproblem

www.jntuworld.com

h

www.jntuworld.com www.jwjobs.net

Page 95: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 95/169

95

True Warehouse

Data Marts

Data Sources

Data Warehouse

www.jntuworld.com

P

www.jntuworld.com www.jwjobs.net

Page 96: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 96/169

96

Query Processing

Indexing

Pre computedviews/aggregates

SQL extensions

www.jntuworld.com

d T h

www.jntuworld.com www.jwjobs.net

Page 97: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 97/169

97

Indexing Techniques

Exploiting indexes to reducescanning of data is of crucialimportance

Bitmap Indexes

Join Indexes

Other Issues

Text indexing

Parallelizing and sequencing of indexbuilds and incremental updates

www.jntuworld.com

Indexing Techniques

www.jntuworld.com www.jwjobs.net

Page 98: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 98/169

98

Indexing Techniques

Bitmap index:

A collection of bitmaps -- one for eachdistinct value of the column

Each bitmap has N bits where N is thenumber of rows in the table

A bit corresponding to a value v for a

row r is set if and only if r has the valuefor the indexed attribute

www.jntuworld.com

Bi M I d

www.jntuworld.com www.jwjobs.net

Page 99: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 99/169

99

BitMap Indexes

An alternative representation of RID-list

Specially advantageous for low-cardinalitydomains

Represent each row of a table by a bitand the table as a bit vector

There is a distinct bit vector Bv for eachvalue v for the domain

Example: the attribute sex has values Mand F. A table of 100 million peopleneeds 2 lists of 100 million bits

www.jntuworld.com

Bi I d

www.jntuworld.com www.jwjobs.net

Page 100: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 100/169

100

Customer Query : select * from customer where

gender = ‘F’ and vote = ‘Y’

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

Bitmap Index

M

F

F

F

F

M

 Y

 Y

 Y

N

N

N

www.jntuworld.com

Bit M I d

www.jntuworld.com www.jwjobs.net

Page 101: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 101/169

101

Bit Map Index

Cust Region Rating

C1 N H

C2 S M

C3 W L

C4 W H

C5 S L

C6 W L

C7 N H

Base TableBase Table

Row ID N S E W

1 1 0 0 0

2 0 1 0 0

3 0 0 0 1

4 0 0 0 1

5 0 1 0 0

6 0 0 0 1

7 1 0 0 0

Row ID H M L

1 1 0 0

2 0 1 0

3 0 0 0

4 0 0 0

5 0 1 0

6 0 0 0

7 1 0 0

Rating Index Rating Index Region Index Region Index 

Customers where Customers where   Region = WRegion = W Rating = MRating = M And   And  

www.jntuworld.com

BitM I d

www.jntuworld.com www.jwjobs.net

Page 102: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 102/169

102

BitMap Indexes

Comparison, join and aggregation operationsare reduced to bit arithmetic with dramaticimprovement in processing time

Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well.

Compression (e.g., run-length encoding)exploited

Products that support bitmaps: Model 204,TargetIndex (Redbrick), IQ (Sybase), Oracle7.3

www.jntuworld.com

Join Indexes

www.jntuworld.com www.jwjobs.net

Page 103: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 103/169

103

Join Indexes

Pre-computed joins

A join index between a fact table and adimension table correlates a dimension

tuple with the fact tuples that have thesame value on the common dimensionalattribute e.g., a join index on city dimension of calls

fact table correlates for each city the calls (in the calls 

table) from that city

www.jntuworld.com

J i I d

www.jntuworld.com www.jwjobs.net

Page 104: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 104/169

104

Join Indexes

Join indexes can also span multipledimension tables

e.g., a join index on city and time 

dimension of calls fact table

www.jntuworld.com

Star Join Processing

www.jntuworld.com www.jwjobs.net

Page 105: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 105/169

105

Star Join Processing

Use join indexes to join dimensionand fact table

Calls

C+T 

C+T+L

C+T+L+P

Time

Loca-tion

Plan

www.jntuworld.com

Optimized Star Join Processing

www.jntuworld.com www.jwjobs.net

Page 106: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 106/169

106

Optimized Star Join Processing

Time

Loca-tion

Plan

Calls

Virtual Cross Product of T, L and P

 Apply Selections

www.jntuworld.com

Bitmapped Join Processing

www.jntuworld.com www.jwjobs.net

Page 107: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 107/169

107

Bitmapped Join Processing

AND

Time

Loca-tion

Plan

Calls

Calls

Calls

Bitmaps10

1

001

110

www.jntuworld.com

Intelligent Scan

www.jntuworld.com www.jwjobs.net

Page 108: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 108/169

108

Intelligent Scan

Piggyback multiple scans of arelation (Redbrick)

piggybacking also done if second scan

starts a little while after the first scan

www.jntuworld.com

Parallel Query Processing

www.jntuworld.com www.jwjobs.net

Page 109: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 109/169

109

Parallel Query Processing

Three forms of parallelism

Independent

Pipelined

Partitioned and “partition and replicate” 

Deterrents to parallelism

startup

communication

www.jntuworld.com

Parallel Query Processing

www.jntuworld.com www.jwjobs.net

Page 110: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 110/169

110

Parallel Query Processing

Partitioned Data Parallel scans

Yields I/O parallelism

Parallel algorithms for relational operators Joins, Aggregates, Sort

Parallel Utilities Load, Archive, Update, Parse, Checkpoint,

Recovery

Parallel Query Optimization

www.jntuworld.com

Pre-computed Aggregates

www.jntuworld.com www.jwjobs.net

Page 111: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 111/169

111

Pre computed Aggregates

Keep aggregated data forefficiency (pre-computed queries)

Questions Which aggregates to compute?

How to update aggregates?

How to use pre-computed aggregatesin queries?

www.jntuworld.com

Pre computed Aggregates

www.jntuworld.com www.jwjobs.net

Page 112: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 112/169

112

Pre-computed Aggregates

Aggregated table can be maintainedby the

warehouse server

middle tier

client applications

Pre-computed aggregates -- special

case of materialized views -- samequestions and issues remain

www.jntuworld.com

SQL Extensions

www.jntuworld.com www.jwjobs.net

Page 113: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 113/169

113

SQL E t ns ons

Extended family of aggregatefunctions

rank (top 10 customers)

percentile (top 30% of customers)

median, mode

Object Relational Systems allow

addition of new aggregate functions

www.jntuworld.com

SQL Extensions

www.jntuworld.com www.jwjobs.net

Page 114: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 114/169

114

SQL Extensions

Reporting features running total, cumulative totals

Cube operator

group by on all subsets of a set of attributes (month,city)

redundant scan and sorting of data can

be avoided

www.jntuworld.com

Red Brick has Extended set ofAggregates

www.jntuworld.com www.jwjobs.net

Page 115: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 115/169

115

Aggregates

Select month, dollars, cume(dollars) asrun_dollars, weight, cume(weight) asrun_weightsfrom sales, market, product, period t

 where year = 1993and product like ‘Columbian%’and city like ‘San Fr%’order by t.perkey

www.jntuworld.com

RISQL (Red Brick Systems)Extensions

www.jntuworld.com www.jwjobs.net

Page 116: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 116/169

116

Extensions

Aggregates CUME

MOVINGAVG

MOVINGSUM RANK

TERTILE

RATIOTOREPORT

Calculating RowSubtotals BREAK BY

Sophisticated DateTime Support DATEDIFF

Using SubQueries

in calculations

www.jntuworld.com

Using SubQueries in Calculations

www.jntuworld.com www.jwjobs.net

Page 117: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 117/169

117

Using SubQueries in Calculations

select product, dollars as jun97_sales,(select sum(s1.dollars)from market mi, product pi, period, ti, sales si

 where pi.product = product.productand ti.year = period.yearand mi.city = market.city) as total97_sales,100 * dollars/

(select sum(s1.dollars)from market mi, product pi, period, ti, sales si where pi.product = product.productand ti.year = period.yearand mi.city = market.city) as percent_of_yr

from market, product, period, sales

 where year = 1997and month = ‘June’ and city like ‘Ahmed%’

order by product;

www.jntuworld.com

Course Overview

www.jntuworld.com www.jwjobs.net

Page 118: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 118/169

118

Course Overview

The course:what and how

0. Introduction

I. Data Warehousing

II. Decision Supportand OLAP

III. Data Mining

IV. Looking Ahead

Demos and Labs www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 119: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 119/169

II. On-Line Analytical Processing (OLAP)

Making Decision

Support Possible

www.jntuworld.com

Limitations of SQL

www.jntuworld.com www.jwjobs.net

Page 120: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 120/169

120

Limitations of SQL

 “A Freshman in

Business needs

a Ph.D. in SQL” 

-- Ralph Kimball

www.jntuworld.com

Typical OLAP Queries

www.jntuworld.com www.jwjobs.net

Page 121: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 121/169

121

Typical OLAP Queries

Write a multi-table join to compare sales for each

product line YTD this year vs. last year.

Repeat the above process to find the top 5

product contributors to margin.

Repeat the above process to find the sales of a

product line to new vs. existing customers.

Repeat the above process to find the customers

that have had negative sales growth.

www.jntuworld.com

What Is OLAP?

www.jntuworld.com www.jwjobs.net

Page 122: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 122/169

122

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software*

Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, Executive

Information System

OLAP = Multidimensional Database

MOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)

ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)

www.jntuworld.com

The OLAP Market

www.jntuworld.com www.jwjobs.net

Page 123: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 123/169

123

The OLAP Market

Rapid growth in the enterprise market 1995: $700 Million 1997: $2.1 Billion

Significant consolidation activity among

major DBMS vendors 10/94: Sybase acquires ExpressWay 7/95: Oracle acquires Express 11/95: Informix acquires Metacube 1/97: Arbor partners up with IBM 10/96: Microsoft acquires Panorama

Result: OLAP shifted from small verticalniche to mainstream DBMS category

www.jntuworld.com

Strengths of OLAP

www.jntuworld.com www.jwjobs.net

Page 124: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 124/169

124

Strengths of OLAP

It is a powerful visualization paradigm

It provides fast, interactive response

times It is good for analyzing time series

It can be useful to find some clusters and

outliers

Many vendors offer OLAP tools

www.jntuworld.com

OLAP Is FASMI

www.jntuworld.com www.jwjobs.net

Page 125: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 125/169

125

Nigel Pendse, Richard Creath - The OLAP ReportNigel Pendse, Richard Creath - The OLAP Report

OLAP Is FASMI

Fast

Analysis

Shared

Multidimensional

Information

www.jntuworld.com

M lti di i l D t

www.jntuworld.com www.jwjobs.net

Page 126: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 126/169

126

MonthMonth

11 22 33 44 776655

   P

  r  o   d  u  c   t

   P

  r  o   d  u  c   t

ToothpasteToothpaste

JuiceJuiceColaCola

MilkMilk

CreamCream

SoapSoap

   R  e  g 

   i  o  n

   R  e  g 

   i  o  n

WWSSNN

Dimensions:Dimensions: Product, Region, TimeProduct, Region, Time

Hierarchical summarization pathsHierarchical summarization paths

ProductProduct RegionRegion TimeTime

Industry Country Year Industry Country Year 

Category Region Quarter Category Region Quarter 

Product City Month WeekProduct City Month Week

 

Office DayOffice Day

Multi-dimensional Data

 “Hey…I sold $100M worth of goods” 

www.jntuworld.com

Data Cube Lattice

www.jntuworld.com www.jwjobs.net

Page 127: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 127/169

127

Data Cube Lattice

Cube lattice ABC

AB AC BCA B C

none Can materialize some groupbys, compute others

on demand

Question: which groupbys to materialze?

Question: what indices to create

Question: how to organize data (chunks, etc)

www.jntuworld.com

Visualizing Neighbors is simpler

www.jntuworld.com www.jwjobs.net

Page 128: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 128/169

128

Visualizing Neighbors is simpler

1 2 3 4 5 6 7 8

Apr

May

 J un

 J ulAug

Sep

Oct

Nov

Dec

 J an

Feb

Mar

Month Store SalesApr 1

Apr 2

Apr 3

Apr 4

Apr 5

Apr 6

Apr 7

Apr 8

May 1

May 2

May 3

May 4

May 5

May 6

May 7

May 8

 J un 1

 J un 2

www.jntuworld.com

A Visual Operation: Pivot (Rotate)

www.jntuworld.com www.jwjobs.net

Page 129: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 129/169

129

A Visual Operation: Pivot (Rotate)

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk 

CreamCream

N    Y    N    Y    L    A   L    A   

S   F    S   F    

3/1 3/2 3/3 3/43/1 3/2 3/3 3/4

DateDate

   M  o  n   t   h

   M  o  n   t   h

         R        e        g         i        o        n

         R        e        g         i        o        n

ProductProduct

www.jntuworld.com

“Slicing and Dicing”

www.jntuworld.com www.jwjobs.net

Page 130: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 130/169

130

Slicing and Dicing

Product

Sales Channel

   R  e  g   i  o

  n  s

RetailDirect Special

Household

 Telecomm

Video

Audio IndiaFar East

Europe

 The Telecomm Slice

www.jntuworld.com

Roll-up and Drill Down

www.jntuworld.com www.jwjobs.net

Page 131: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 131/169

131

Roll up and Drill Down

Sales Channel

Region

Country

State

Location Address

SalesRepresentative

   R  o   l   l    U  p

Higher Level of Aggregation

Low-levelDetails

D     r    i      l     l     -   D     

 o    w    n    

www.jntuworld.com

Nature of OLAP Analysis

www.jntuworld.com www.jwjobs.net

Page 132: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 132/169

132

Nature of OLAP Analysis

Aggregation -- (total sales,percent-to-total)

Comparison -- Budget vs.Expenses

Ranking -- Top 10, quartileanalysis

Access to detailed and

aggregate data Complex criteria

specification

Visualizationwww.jntuworld.com

Organizationally Structured Data

www.jntuworld.com www.jwjobs.net

Page 133: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 133/169

133

Organizationally Structured Data

Different Departments look at the samedetailed data in different ways. Withoutthe detailed, organizationally structureddata as a foundation, there is noreconcilability of data

marketing

manufacturing

sales

finance

www.jntuworld.com

Multidimensional Spreadsheets

www.jntuworld.com www.jwjobs.net

Page 134: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 134/169

134

m p

Analysts needspreadsheets that support pivot tables (cross-tabs)

drill-down and roll-up

slice and dice sort

selections

derived attributes

Popular in retail domain

www.jntuworld.com

OLAP - Data Cube

www.jntuworld.com www.jwjobs.net

Page 135: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 135/169

135

OLAP Data Cube

Idea: analysts need to group data in manydifferent ways

eg. Sales(region, product, prodtype, prodstyle,date, saleamount)

saleamount is a measure attribute, rest aredimension attributes

groupby every subset of the other attributes

materialize (precompute and store)

groupbys to give online response Also: hierarchies on attributes: date ->

weekday,date -> month -> quarter -> year

www.jntuworld.com

SQL Extensions

www.jntuworld.com www.jwjobs.net

Page 136: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 136/169

136

SQL Extensions

Front-end tools require Extended Family of Aggregate Functions

rank, median, mode

Reporting Features running totals, cumulative totals

Results of multiple group by total sales by month and total sales by

product Data Cube

www.jntuworld.com

Relational OLAP: 3 Tier DSS

www.jntuworld.com www.jwjobs.net

Page 137: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 137/169

137

Relational OLAP: 3 Tier DSS

Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer  

Store atomic

data inindustrystandardRDBMS.

Generate SQL

execution plansin the ROLAPengine to obtainOLAPfunctionality.

Obtain multi-

dimensionalreports fromthe DSS Client.

www.jntuworld.com

MD-OLAP: 2 Tier DSS

www.jntuworld.com www.jwjobs.net

Page 138: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 138/169

138

MD-OLAP: 2 Tier DSS

MDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer  

Store atomic data in aproprietary data structure(MDDB), pre-calculate as manyoutcomes as possible, obtainOLAP functionality via proprietaryalgorithms running against this

data

Obtain multi-dimensionalreports from theDSS Client.

www.jntuworld.com

Typical OLAP ProblemsD t E l i

www.jntuworld.com www.jwjobs.net

Page 139: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 139/169

139

16 81 256 10244096

16384

65536

0

10000

20000

30000

40000

50000

60000

70000

2 3 4 5 6 7 8

Data Explosion SyndromeData Explosion Syndrome

Number of DimensionsNumber of Dimensions

   N  u  m   b

  e  r  o   f   A  g  g  r  e  g  a   t   i  o  n  s

   N  u  m   b

  e  r  o   f   A  g  g  r  e  g  a   t   i  o  n  s

(4 levels in each dimension)(4 levels in each dimension)

Data Explosion

Microsoft TechEd’98www.jntuworld.com

Metadata Repository

www.jntuworld.com www.jwjobs.net

Page 140: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 140/169

140

p y

Administrative metadata source databases and their contents

gateway descriptions

warehouse schema, view & derived data definitions

dimensions, hierarchies pre-defined queries and reports

data mart locations and contents

data partitions

data extraction, cleansing, transformation rules,defaults

data refresh and purging rules

user profiles, user groups

security: user authorization, access control

www.jntuworld.com

Metdata Repository .. 2

www.jntuworld.com www.jwjobs.net

Page 141: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 141/169

141

p y

Business data business terms and definitions

ownership of data

charging policies

operational metadata data lineage: history of migrated data and

sequence of transformations applied

currency of data: active, archived, purged

monitoring information: warehouse usagestatistics, error reports, audit trails.

www.jntuworld.com

Recipe for a Successful

www.jntuworld.com www.jwjobs.net

Page 142: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 142/169

pWarehouse

www.jntuworld.com

For a Successful Warehouse

www.jntuworld.com www.jwjobs.net

Page 143: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 143/169

143

From day one establish that warehousingis a joint user/builder project

Establish that maintaining data quality willbe an ONGOING joint user/builderresponsibility

Train the users one step at a time

Consider doing a high level corporate datamodel in no more than three weeks

From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.ht

www.jntuworld.com

For a Successful Warehouse

www.jntuworld.com www.jwjobs.net

Page 144: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 144/169

144

Look closely at the data extracting,cleaning, and loading tools

Implement a user accessible automated

directory to information stored in thewarehouse

Determine a plan to test the integrity of the data in the warehouse

From the start get warehouse users in thehabit of 'testing' complex queries

www.jntuworld.com

For a Successful Warehouse

www.jntuworld.com www.jwjobs.net

Page 145: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 145/169

145

Coordinate system roll-out with networkadministration personnel

When in a bind, ask others who have done

the same thing for advice Be on the lookout for small, but strategic,

projects

Market and sell your data warehousingsystems

www.jntuworld.com

Data Warehouse Pitfalls

www.jntuworld.com www.jwjobs.net

Page 146: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 146/169

146

You are going to spend much time extracting,cleaning, and loading data

Despite best efforts at project management, datawarehousing project scope will increase

You are going to find problems with systemsfeeding the data warehouse

You will find the need to store data not beingcaptured by any existing system

You will need to validate data not being validatedby transaction processing systems

www.jntuworld.com

Data Warehouse Pitfalls

www.jntuworld.com www.jwjobs.net

Page 147: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 147/169

147

Some transaction processing systems feeding thewarehousing system will not contain detail

Many warehouse end users will be trained andnever or seldom apply their training

After end users receive query and report tools,requests for IS written reports may increase

Your warehouse users will develop conflictingbusiness rules

Large scale data warehousing can become anexercise in data homogenizing

www.jntuworld.com

Data Warehouse Pitfalls

www.jntuworld.com www.jwjobs.net

Page 148: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 148/169

148

'Overhead' can eat up great amounts of diskspace

The time it takes to load the warehouse willexpand to the amount of the time in the availablewindow... and then some

Assigning security cannot be done with atransaction processing system mindset

You are building a HIGH maintenance system

You will fail if you concentrate on resourceoptimization to the neglect of project, data, andcustomer management issues and anunderstanding of what adds value to thecustomer

www.jntuworld.com

DW and OLAP Research Issues

www.jntuworld.com www.jwjobs.net

Page 149: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 149/169

149

Data cleaning focus on data inconsistencies, not schema differences

data mining techniques

Physical Design

design of summary tables, partitions, indexes

tradeoffs in use of different indexes

Query processing

selecting appropriate summary tables

dynamic optimization with feedback

acid test for query optimization: cost estimation, use of transformations, search strategies

partitioning query processing between OLAP server andbackend server.

www.jntuworld.com

DW and OLAP Research Issues .. 2

www.jntuworld.com www.jwjobs.net

Page 150: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 150/169

150

Warehouse Management detecting runaway queries

resource management

incremental refresh techniques

computing summary tables during load

failure recovery during load and refresh

process management: scheduling queries,

load and refresh Query processing, caching

use of workflow technology for processmanagement

www.jntuworld.com

www.jntuworld.com www.jwjobs.net

Page 151: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 151/169

Products, References, Useful Links

www.jntuworld.com

Reporting Tools

www.jntuworld.com www.jwjobs.net

Page 152: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 152/169

152

Andyne Computing -- GQL

Brio -- BrioQuery Business Objects -- Business Objects Cognos -- Impromptu Information Builders Inc. -- Focus for Windows Oracle -- Discoverer2000 Platinum Technology -- SQL*Assist, ProReports PowerSoft -- InfoMaker SAS Institute -- SAS/Assist Software AG -- Esperant

Sterling Software -- VISION:Data 

www.jntuworld.com

OLAP and Executive InformationSystems

www.jntuworld.com www.jwjobs.net

Page 153: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 153/169

153

Andyne Computing -- Pablo Arbor Software -- Essbase

Cognos -- PowerPlay

Comshare -- Commander

OLAP Holistic Systems -- Holos

Information Advantage --AXSYS, WebOLAP

Informix -- Metacube Microstrategies --DSS/Agent

Microsoft -- Plato Oracle -- Express

Pilot -- LightShip

Planning Sciences --

Gentium Platinum Technology --

ProdeaBeacon, Forest & Trees

SAS Institute -- SAS/EIS,OLAP++

Speedware -- Media

www.jntuworld.com

Other Warehouse RelatedProducts

www.jntuworld.com www.jwjobs.net

Page 154: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 154/169

154

Data extract, clean, transform,refresh

CA-Ingres replicator

Carleton Passport Prism Warehouse Manager

SAS Access

Sybase Replication Server Platinum Inforefiner, Infopump

www.jntuworld.com

Extraction and TransformationTools

www.jntuworld.com www.jwjobs.net

Page 155: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 155/169

155

Carleton Corporation -- Passport

Evolutionary Technologies Inc. -- Extract

Informatica -- OpenBridge

Information Builders Inc. -- EDA Copy Manager

Platinum Technology -- InfoRefiner

Prism Solutions -- Prism Warehouse Manager

Red Brick Systems -- DecisionScape Formation

www.jntuworld.com

Scrubbing Tools

www.jntuworld.com www.jwjobs.net

Page 156: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 156/169

156

Apertus -- Enterprise/Integrator Vality -- IPE

Postal Soft

www.jntuworld.com

Warehouse Products

www.jntuworld.com www.jwjobs.net

Page 157: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 157/169

157

Computer Associates -- CA-Ingres Hewlett-Packard -- Allbase/SQL

Informix -- Informix, Informix XPS

Microsoft -- SQL Server Oracle -- Oracle7, Oracle Parallel Server

Red Brick -- Red Brick Warehouse

SAS Institute -- SAS Software AG -- ADABAS

Sybase -- SQL Server, IQ, MPP 

www.jntuworld.com

Warehouse Server Products

www.jntuworld.com www.jwjobs.net

Page 158: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 158/169

158

Oracle 8 Informix

Online Dynamic Server XPS --Extended Parallel Server Universal Server for object relational

applications

Sybase

Adaptive Server 11.5 Sybase MPP Sybase IQ

www.jntuworld.com

Warehouse Server Products

www.jntuworld.com www.jwjobs.net

Page 159: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 159/169

159

Red Brick Warehouse Tandem Nonstop

IBM

DB2 MVS

Universal Server

DB2 400

Teradata

www.jntuworld.com

Other Warehouse RelatedProducts

www.jntuworld.com www.jwjobs.net

Page 160: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 160/169

160

Connectivity to Sources Apertus

Information Builders EDA/SQL

Platimum Infohub SAS Connect

IBM Data Joiner

Oracle Open Connect Informix Express Gateway

www.jntuworld.com

Other Warehouse RelatedProducts

www.jntuworld.com www.jwjobs.net

Page 161: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 161/169

161

Query/Reporting Environments Brio/Query

Cognos Impromptu

Informix Viewpoint CA Visual Express

Business Objects

Platinum Forest and Trees

www.jntuworld.com

4GL's, GUI Builders, and PCDatabases

www.jntuworld.com www.jwjobs.net

Page 162: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 162/169

162

Information Builders -- Focus Lotus -- Approach

Microsoft -- Access, Visual Basic

MITI -- SQR/Workbench

PowerSoft -- PowerBuilder

SAS Institute -- SAS/AF

www.jntuworld.com

Data Mining Products

www.jntuworld.com www.jwjobs.net

Page 163: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 163/169

163

DataMind -- neurOagent Information Discovery -- IDIS

SAS Institute -- SAS/Neuronets

www.jntuworld.com

Data Warehouse

www.jntuworld.com www.jwjobs.net

Page 164: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 164/169

164

W.H. Inmon, Building the DataWarehouse, Second Edition, John Wileyand Sons, 1996

W.H. Inmon, J. D. Welch, Katherine L.Glassey, Managing the Data Warehouse,John Wiley and Sons, 1997

Barry Devlin, Data Warehouse from

Architecture to Implementation, AddisonWesley Longman, Inc 1997

www.jntuworld.com

Data Warehouse

www.jntuworld.com www.jwjobs.net

Page 165: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 165/169

165

W.H. Inmon, John A. Zachman, JonathanG. Geiger, Data Stores Data Warehousingand the Zachman Framework, McGraw HillSeries on Data Warehousing and Data

Management, 1997

Ralph Kimball, The Data WarehouseToolkit, John Wiley and Sons, 1996

www.jntuworld.com

OLAP and DSS

www.jntuworld.com www.jwjobs.net

Page 166: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 166/169

166

Erik Thomsen, OLAP Solutions, John Wileyand Sons 1997

Microsoft TechEd Transparencies fromMicrosoft TechEd 98

Essbase Product Literature

Oracle Express Product Literature

Microsoft Plato Web Site

Microstrategy Web Site

www.jntuworld.com

Data Mining

www.jntuworld.com www.jwjobs.net

Page 167: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 167/169

167

Michael J.A. Berry and Gordon Linoff, DataMining Techniques, John Wiley and Sons1997

Peter Adriaans and Dolf Zantinge, DataMining, Addison Wesley Longman Ltd.1996

KDD Conferences

www.jntuworld.com

Other Tutorials

www.jntuworld.com www.jwjobs.net

Page 168: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 168/169

168

Donovan Schneider, Data Warehousing Tutorial,Tutorial at International Conference for

Management of Data (SIGMOD 1996) and

International Conference on Very Large Data

Bases 97 Umeshwar Dayal and Surajit Chaudhuri, Data

Warehousing Tutorial at International Conference

on Very Large Data Bases 1996

Anand Deshpande and S. Seshadri, Tutorial on

Datawarehousing and Data Mining, CSI-97

www.jntuworld.com

Useful URLs

www.jntuworld.com www.jwjobs.net

Page 169: DWDM Single Ppt Notes

7/14/2019 DWDM Single Ppt Notes

http://slidepdf.com/reader/full/dwdm-single-ppt-notes-563279d76eae8 169/169

Ralph Kimball’s home page http://www.rkimball.com

Larry Greenfield’s Data WarehouseInformation Center http://pwp.starnetinc.com/larryg/

Data Warehousing Institute http://www.dw-institute.com/

OLAP Council