data warehousing/mining comp 150 additional information

68
Data Warehousing/Mining Data Warehousing/Mining Comp 150 Additional Information Instructor: Dan Hebert

Upload: fancy

Post on 22-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Data Warehousing/Mining Comp 150 Additional Information. Instructor: Dan Hebert. The Need for Data Analysis. Constant pressure from external and internal forces requires prompt tactical and strategic decisions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 1

Data Warehousing/Mining

Comp 150Additional Information

Instructor: Dan Hebert

Page 2: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 2

The Need for Data Analysis

Constant pressure from external and internal forces requires prompt tactical and strategic decisions.

The decision-making cycle time is reduced, while problems are increasingly complex with a growing number of internal and external variables.

Managers need support systems for facilitating quick decision making in a complex environment.

Decision support systems (DSS).

Page 3: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 3

Page 4: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 4

Decision Support Systems Decision Support is a methodology (or a series of

methodologies) designed to extract information from data and to use such information as a basis for decision making.

A decision support system (DSS) is an arrangement of computerized tools used to assist managerial decision making within a business. – A DSS usually requires extensive data “massaging” to

produce information.– The DSS is used at all levels within an organization and is

often tailored to focus on specific business areas or problems.

– The DSS is interactive and provides ad hoc query tools to retrieve data and to display data in different formats.

Page 5: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 5

Decision Support Systems

Four Components of a DSS– The data store component is basically a DSS

database. – The data extraction and filtering component is

used to extract and validate the data taken from the operational database and the external data sources.

– The end user query tool is used by the data analyst to create the queries that access the database.

– The end user presentation tool is used by the data analyst to organize and present the data.

Page 6: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 6

Main Components Of A Decision Support System (DSS)

Figure 13.1

Page 7: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 7

Decision Support Systems

Operational Data vs. Decision Support Data– Most operational data are stored in a

relational database in which the structures tend to be highly normalized.

– The operational data storage is optimized to support transactions that represent daily operations.

– Whereas operational data capture daily business transactions, DSS data give tactical and strategic business meaning to the operational data.

Page 8: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 8

Decision Support Systems Three Main Areas in Which DSS Data Differ from

Operational Data– Time span

Operational data represent current (atomic) transactions.

DSS data tend to cover a longer time frame.– Granularity

Operational data represent specific transactions that occur at a given time.

DSS data must be presented at different levels of aggregation.

– Dimensionality Operational data focus on representing atomic

transactions. DSS data can be analyzed from multiple dimensions.

Page 9: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 9

Page 10: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 10

Table 13.2 Contrasting Operational And DSS Data Characteristics

Page 11: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 11

Decision Support Systems

The DSS Database Requirements– Database Schema

The DSS database schema must support complex (non-normalized) data representations.

The queries must be able to extract multidimensional time slices.

Page 12: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 12

Ten Year Sales History For A Single Department,Millions Of Dollars

Table 13.3

Page 13: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 13

Yearly Sales Summaries, Two Stores and Two DepartmentsPer Store, Millions Of Dollars

Table 13.4

Page 14: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 14

Decision Support Systems

– Data Extraction and Loading The DBMS must support advanced data

extracting and filtering tools. The data extraction capabilities should

support different data sources and multiple vendors.

Data filtering capabilities must include the ability to check for inconsistent data or data validation rules.

The DBMS must support advanced data integration, aggregation, and classification capabilities.

Page 15: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 15

Yearly Sales Summaries, 20 Stores, 10 Departments Per Store,Millions Of Dollars

Table 13.5

Page 16: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 16

Decision Support Systems– End-User Analytical Interface

The DSS DBMS must support advanced data modeling and data presentation tools, data analysis tools, and query generation and optimization components.

The end user analytical interface is one of the most critical components.

– Database Size Requirements DSS databases tend to be very large. The DBMS must be capable of supporting very

large databases (VLDB). The DBMS may be required to use advanced

hardware, such as multiple disk arrays and multiple-processor technologies.

Page 17: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 17

The Data Warehouse

The Data Warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making.

– Integrated The Data Warehouse is a centralized, consolidated

database that integrates data retrieved from the entire organization.

– Subject-Oriented The Data Warehouse data is arranged and optimized

to provide answers to questions coming from diverse functional areas within a company.

Page 18: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 18

The Data Warehouse

– Time Variant The Warehouse data represent the flow

of data through time. It can even contain projected data.

– Non-Volatile Once data enter the Data Warehouse,

they are never removed. The Data Warehouse is always growing.

Page 19: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 19

Table 13.6A Comparison Of Data Warehouse And Operational Database Characteristics

Page 20: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 20

Creating A Data Warehouse

Figure 13.3

Page 21: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 21

The Data Warehouse

Data Mart– A data mart is a small, single-subject data

warehouse subset that provides decision support to a small group of people.

– Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses.

– Data Marts address local or departmental problems, while a Data Warehouse involves a company-wide effort to support decision making at all levels in the organization.

Page 22: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 22

DSS Architectural Styles

Table 13.7

Page 23: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 23

The Data Warehouse Twelve Rules That Define a Data Warehouse

1.The Data Warehouse and operational environments are separated.2.The Data Warehouse data are integrated.3.The Data Warehouse contains historical data over a long time horizon.4.The Data Warehouse data are snapshot data captured at a given point in time.5.The Data Warehouse data are subject-oriented.6.The Data Warehouse data are mainly read-only with periodic batch updates from operational data. No online updates are allowed. 7.The Data Warehouse development life cycle differs from classical systems development. The Data Warehouse development is data driven; the classical approach is process driven.

Page 24: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 24

8.The Data Warehouse contains data with several levels of detail; current detail data, old detail data, lightly summarized, and highly summarized data.9.The Data Warehouse environment is characterized by read-only transactions to very large data sets. The operational environment is characterized by numerous update transactions to a few data entities at the time.10.The Data Warehouse environment has a system that traces data resources, transformation, and storage.11.The Data Warehouse’s metadata are a critical component of this environment. The metadata identify and define all data elements. The metadata provide the source, transformation, integration, storage, usage, relationships, and history of each data element.12.The Data Warehouse contains a charge-back mechanism for resource usage that enforces optimal use of the data by end users.

The Data Warehouse

Page 25: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 25

On-Line Analytical Processing

On-Line Analytical Processing (OLAP) is an advanced data analysis environment that supports decision making, business modeling, and operations research activities.

Four Main Characteristics of OLAP– Use multidimensional data analysis techniques– Provide advanced database support– Provide easy-to-use end user interfaces– Support client/server architecture

Page 26: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 26

On-Line Analytical Processing

Multidimensional Data Analysis Techniques– The processing of data in which data are

viewed as part of a multidimensional structure.

– Multidimensional view allows end users to consolidate or aggregate data at different levels.

– Multidimensional view allows a business analyst to easily switch business perspectives.

– Figure 13.4

Page 27: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 27

Figure 13.4 Operational Vs. Multidimensional View Of Sales

Page 28: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 28

On-Line Analytical Processing

Additional Functions of Multidimensional Data Analysis Techniques– Advanced data presentation functions– Advanced data aggregation,

consolidation, and classification functions

– Advanced computational functions– Advanced data modeling functions

Page 29: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 29

Figure 13.5 Integration Of OLAP With A Spreadsheet Program

Page 30: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 30

On-Line Analytical Processing

Advanced Database Support– Access to many different kinds of DBMSs, flat files, and

internal and external data sources.– Access to aggregated Data Warehouse data as well as

to the detail data found in operational databases.– Advanced data navigation features such as drill-down

and roll-up.– Rapid and consistent query response times.– The ability to map end user requests, expressed in

either business or model terms, to the appropriate data source and then to the proper data access language.

– Support for very large databases.

Page 31: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 31

On-Line Analytical Processing

Easy-to-Use End User Interface– Easy-to-use graphical user interfaces

make sophisticated data extraction and analysis tools easily accepted and readily used.

Client/Server Architecture– The client/server environment enables

us to divide an OLAP system into several components that define its architecture.

Page 32: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 32

On-Line Analytical Processing

OLAP Architecture– Three Main Modules

OLAP Graphical User Interface (GUI) OLAP Analytical Processing Logic OLAP Data Processing Logic

– OLAP systems are designed to use both operational and Data Warehouse data.

Page 33: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 33

Figure 13.7 OLAP Server Arrangement

Page 34: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 34

Figure 13.8 OLAP Server With Multidimensional Data Store Arrangement

Page 35: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 35

Figure 13.9 OLAP Server With Local Mini Data-Marts

Page 36: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 36

On-Line Analytical Processing

Relational OLAP– Relational On-Line Analytical Processing

(ROLAP) provides OLAP functionality by using relational database and familiar relational query tools.

– Extensions to RDBMS Multidimensional data schema support within the

RDBMS Data access language and query performance

optimized for multidimensional data Support for very large databases

Page 37: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 37

On-Line Analytical Processing

– Multidimensional Data Schema Support within the RDBMS

Normalization of tables in relational technology is seen as a stumbling block to its use in OLAP systems.

DSS data tend to be non-normalized, duplicated, and pre- aggregated.

ROLAP uses a special design technique to enable RDBMS technology to support multidimensional data representations, known as star schema.

Star schema creates the near equivalent of a multidimensional database schema from the existing relational database.

Page 38: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 38

On-Line Analytical Processing– Data Access Language and Query Performance

Optimized for Multidimensional Data Most decision support data requests require the use of

multiple-pass SQL queries or multiple nested SQL statements.

ROLAP extends SQL so that it can differentiate between access requirements for data warehouse data and operational data.

– Support for Very Large Databases Decision support data are normally loaded in bulk (batch)

mode from the operational data. RDBMS must have the proper tools to import, integrate,

and populate the data warehouse with operational data. The speed of the data-loading operations is important.

Page 39: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 39

Figure 13.10 A Typical ROLAP Client/Server Architecture

Page 40: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 40

On-Line Analytical Processing Multidimensional OLAP (MOLAP)

– MOLAP extends OLAP functionality to multidimensional databases (MDBMS).

– MDBMS end users visualize the stored data as a multidimensional cube known as a data cube.

– Data cubes are created by extracting data from the operational databases or from the data warehouse.

– Data cubes are static and require front-end design work.

– To speed data access, data cubes are normally held in memory, called cube cache.

– MOLAP is generally faster than their ROLAP counterparts. It is also more resource-intensive.

– MDBMS is best suited for small and medium data sets.

– Multidimensional data analysis is also affected by how the database system handles sparsity.

Page 41: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 41

MOLAP Client/Server Architecture

Figure 13.11

Page 42: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 42

Relational Vs. Multidimensional OLAP

Table 13.8

Page 43: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 43

Star Schema

The star schema is a data-modeling technique used to map multidimensional decision support into a relational database.

Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.

Four Components:– Facts– Dimensions– Attributes– Attribute hierarchies

Page 44: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 44

A Simple Star Schema

Figure 13.12

Page 45: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 45

Star Schema

Facts– Facts are numeric measurements (values) that

represent a specific business aspect or activity.– The fact table contains facts that are linked through

their dimensions.– Facts can be computed or derived at run-time

(metrics).

Dimensions– Dimensions are qualifying characteristics that provide

additional perspectives to a given fact.– Dimensions are stored in dimension tables.

Page 46: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 46

Star Schema Attributes

– Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts.

– Dimensions provide descriptive characteristics about the facts through their attributes.

Table 13.9 Possible Attributes For Sales Dimensions

Page 47: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 47

Three Dimensional View Of Sales

Figure 13.13

Page 48: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 48

Slice And Dice View Of Sales

Figure 13.14

Page 49: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 49

Star Schema

Attribute Hierarchies– Attributes within dimensions can be

ordered in a well-defined attribute hierarchy.

– The attribute hierarchy provides a top-down data organization that is used for two main purposes:

Aggregation Drill-down/roll-up data analysis

Page 50: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 50

A Location Attribute Hierarchy

Figure 13.15

Page 51: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 51

Attribute Hierarchies In Multidimensional Analysis

Figure 13.16

Page 52: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 52

Star Schema

Star Schema Representation– Facts and dimensions are normally

represented by physical tables in the data warehouse database.

– The fact table is related to each dimension table in a many-to-one (M:1) relationship.

– Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Page 53: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 53Figure 13.17 Star Schema For Sales

Page 54: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 54Figure 13.18 Orders Star Schema

Page 55: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 55

Star Schema

Performance-Improving Techniques– Normalization of dimensional tables– Multiple fact tables representing

different aggregation levels– Denormalization of fact tables– Table partitioning and replication

Page 56: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 56

Figure 13.19 Normalized Dimension Tables

Page 57: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 57

Figure 13.20Multiple Fact Tables

Page 58: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 58

Data Warehouse Implementation The Data Warehouse as an Active Decision

Support Network– A Data Warehouse is a dynamic support

framework.– Implementation of a Data Warehouse is part

of a complete database-system-development infrastructure for company-wide decision support.

– Its design and implementation must be examined in the light of the entire infrastructure.

Page 59: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 59

Data Warehouse Implementation A Company-Wide Effort that Requires User

Involvement and Commitment at All Levels– For a successful design and

implementation, the designer must: Involve end users in the process. Secure end users’ commitment from the

beginning. Create continuous end user feedback. Manage end user expectations. Establish procedures for conflict resolution.

Page 60: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 60

Data Warehouse Implementation Satisfy the Trilogy: Data, Analysis, and

Users– For a successful design and

implementation, the designer must satisfy:

Data integration and loading criteria. Data analysis capabilities with acceptable

query performance. End user data analysis needs.

Page 61: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 61

Data Warehouse Implementation Apply Database Design Procedures

– Data Warehouse development is a company-wide effort and requires many resources: people, financial, and technical.

The sheer and often mind-boggling quantity of decision support data is likely to require the latest hardware and software.

It is also imperative to have very detailed procedures to orchestrate the flow of data from the operational databases to the Dare Warehouse.

To implement and support the Data Warehouse architecture, we also need people with advanced database design, software integration, and management skills.

Page 62: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 62

Data Warehouse Implementation Road Map

Figure 13.21

Page 63: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 63

Data Mining

In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive.

Data mining tools automatically search the data for anomalies and possible relationships, thereby identifying problems that have not yet been identified by the end user.

Data mining tools -- based on algorithms that form the building blocks for artificial intelligence, neural networks, inductive rules, and predicate logic -- initiate analysis to create knowledge.

Page 64: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 64

Extraction Of Knowledge From Data

Figure 13.22

Page 65: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 65

Data Mining

Four Phases of Data Mining1. Data Preparation

Identify and cleanse data sets. Data Warehouse is usually used for data mining

operations.

2. Data Analysis and Classification Identify common data characteristics or patterns

using– Data groupings, classifications, clusters, or sequences.– Data dependencies, links, or relationships.– Data patterns, trends, and deviations.

Page 66: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 66

Data Mining

3. Knowledge Acquisition Select the appropriate modeling or knowledge

acquisition algorithms. Examples: neural networks, decision trees, rules

induction, genetic algorithms, classification and regression tree, memory-based reasoning, or nearest neighbor and data visualization.

4. Prognosis Predict future behavior and forecast business

outcomes using the data mining findings.

Page 67: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 67

Data-Mining Phases

Figure 13.23

Page 68: Data Warehousing/Mining Comp 150 Additional Information

Data Warehousing/Mining 68

A Sample Of Current Data Warehousing And Data Mining Vendors

Table 13.10