h.lu/hkust l08: introduction to olap. mtmi519: data warehousing & olap -- 2 h.lu/hkust evolution...

51
H.Lu/HKUST L08: Introduction to OLAP

Upload: gavin-wade

Post on 13-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

H.Lu/HKUST

L08: Introduction to OLAP

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 2

Evolution of Database Technology

1960s: Hierarchical (IMS) & network (CODASYL) DBMS.

1970s: Relational data model, relational DBMS implementation.

1980: RDBMS rules the earth 1985-: Advanced data models (extended-relational,

OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.).

1990s: ORDB, OLAP, Data mining, data warehousing, multimedia databases, and network databases.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 3

New Business Environment Economic crisis

Importance of market & credit risk management for banks Deregulation

Intensifying competition heightened interest in retaining & acquiring good

customers Mergers & Acquisitions

Needs for consolidated view of business Created diverse computer systems within large

corporations. E-Business

New way of reaching customers. Opportunity for 1:1 marketing

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 4

What This Means

Increasing competition customers have more choices price wars, such as the one in HK

each operators finds its own niche (value, coverage, customer service) Increasing “churn” focus on loyalty, customer relationship management

With the similar technology, customers become more important to business turn from product oriented to customer oriented

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 5

An Example

Customer service hotline of a mobile phone company

Cust A: You made a mistake in my last month statement ….Receptionist: Let me check…Oh, you are right…. As a token of

apology, we offer you one month free service.

Cust B: You made a mistake in my last month statement ….Receptionist: Let me check…Oh, you are right…. As a token of

apology, we will send you two free movie tickets.

On line decision making

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 6

Changes in Business Strategy

Old-style incumbent one focuses on reducing cost improving product penetration (find a customer for a

product not vice versa)

New-style aggressive one focuses on getting closer to customers finding new ways to increase revenue from customers satisfaction loyalty more customers revenue

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 7

How Does IT Play Its Role?

From traditional to OLTP to OLAP OLTP: on-line transaction processing OLAP: on-line analytical processing

To better support OLAP, warehousing your business data querying one clean, integrated data warehouse rather

than dozens operational databases To do more and better than OLAP, consider data mining

discovering knowledge from operational data turning the huge volume of data into a mine of

gold/diamond

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 8

How Can IT Play the Role?

Current situation In most organizations, data about specific parts of

business is there -- lots and lots of data, somewhere, in some form.

Data is available but not information -- and not the right information at the right time

What should we do? To bring together information from multiple sources as

to provide a consistent database source for decision support queries.

To off-load decision support applications from the on-line transaction system.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 9

Decision Support

Decision Support is a term used to describe the capability of a system to support the formulation of business decisions through complex queries against a database.

It can also specifically refer to a database which is intended for this purpose, as opposed to one which primarily supports on-line transaction processing operations.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 10

Evolution of Electronic Data Processing 60’s: Batch Reports

hard to find and analyze information inflexible & expensive, reprogram every new request

70’s: Terminal-Based DSS and EIS still inflexible, not integrated with desktop tools

80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational DB

90’s: Data warehouse with integrated OLAP engines and tools

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 11

OLTP vs. Decision Support Queries Traditionally, DBMS have been used for on-line transaction

processing (OLTP) order entry: pull up order 990101 and update status field banking: transfer $1000 from account X to account Y

DSS: Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions What were the sales volumes by region and product

category for the last year? How did the share price of computer manufacturers

correlate with quarterly profits over the past 10 years? Will a 10% discount increase sales volume sufficiently?

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 12

TPC-D Benchmark Query #16

Counts the number of Suppliers who can supply Parts that satisfy a particular customer's requirements. The Customer is interested in Parts of eight different sizes as long as they are not a given type, not of a given brand, and not from a Supplier who has had complaints registered at the Better Business Bureau. Results must be presented in descending count and ascending brand, type and size.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 13

TPC-H Benchmark Query #16 (SQL)

SELECT P_BRAND, P_TYPE, P_SIZE, COUNT(DISTINCT PS_SUPPKEY)(NAMED SUPPLIER_CNT) FROM PARTSUPP, PARTTBL WHERE P_PARTKEY = PS_PARTKEY AND P_BRAND <> 'Brand#45' AND P_TYPE NOT LIKE 'MEDIUM POLISHED%' AND P_SIZE IN (49, 14, 23, 45, 19, 3, 36, 9) AND PS_SUPPKEY NOT IN ( SELECT S_SUPPKEY FROM SUPPLIER WHERE S_COMMENT LIKE '%Better Business Bureau%Complaints% ') GROUP BY P_BRAND, P_TYPE, P_SIZE ORDER BY SUPPLIER_CNT DESC, P_BRAND, P_TYPE, P_SIZE;

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 14

OLTP Applications

clerical data processing tasks update-intensive detailed up-to-date data structured, repetitive tasks short transactions are the unit of work read and/or update a few records isolation, recovery and integrity are critical

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 15

Decision Support & OLAP

Decision support applications typically consist of long and often complex read-only queries that access

large portions of the database. Databases for Decision Support

Decision support databases are updated relatively infrequently, either by periodic

batch runs, or by background "trickle" update streams. need not contain real-time or up-to-the-minute

information, as decision support applications tend to process large amounts of data which usually would not be affected significantly by individual transactions.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 16

OLTP vs. OLAP OLTP OLAP

users Clerk, IT professional Knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 17

Why Data Warehousing needed?

Lack of historical business data Data required for analysis often resides in

different operational systems Query performance is extremely poor when the

analysis is done in the operational systems. Operational DBMS were not designed for decision

support

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 18

The Architecture of Data

What’s has been learned from data

logical model physical layout of data summaries by who,

what, when, where,... who, what, when,

where, ...Operational data

Metadata

Database schema

Summary data

Businessrules

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 19

Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process. --- W. H. Inmon

A decision support database that is used primarily in organizational decision making.

A collection of data maintained separately from the organization’s operational database

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 20

DW: Subject Oriented & Integrated Subject oriented

oriented to the major subject areas of the corporation that have been defined in the data model.

• E.g. for an insurance company: customer, product, transaction or activity, policy, claim, account, and etc.

operational DB and applications may be organized differently

• E.g. based on type of insurance's: auto, life, medical, fire, ... Integrated

There is no consistency in encoding, naming conventions, … among different data sources

When data is moved to the warehouse, it is converted.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 21

DW: Non-Volatile & Time Variant Non-volatile

Operational data is regularly accessed and manipulated a record at a time and update is done to data in the operational environment.

Warehouse Data is loaded and accessed. Update of data does not occur in the data warehouse environment.

Time Variant The time horizon for the data warehouse is significantly longer

than that of operational systems. Operational database contain current value data. Data warehouse

data is nothing more than a sophisticated series of snapshots, taken as of some moment in time.

The key structure of operational data may or may not contain some element if time. The key structure of the data warehouse always contains some element of time.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 22

Why Separate Data Warehouse Performance

special data organization, access methods, and implementation methods are needed to support multidimensional views and operations typical of OLAP

Complex OLAP queries would degrade performance for operational transactions

Function missing data: Decision support requires historical data which

operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources: operational DBs, external sources

data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 23

The Reference Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Servers

AnalysisQueryReportsData mining

Data Sources Tools

Serve

Data Marts

Other

Sources

Operational DBs

Monitor &Integrator

Metadatarepository

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 24

Data Sources

Data sources are often the operational systems, providing the lowest level of data.

Data sources are designed for operational use, not for decision support, and the data reflect this fact.

Multiple data sources are often from different systems run on a wide range of hardware and much of the software is built in-house or highly customized.

Multiple data sources introduce a large number of issues -- semantic conflicts.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 25

Data Extraction, Cleaning and Integration Important to warehouse clean data (operational data

from multiple sources are often dirty). Three classes of tools

Data migration: allows simple data transformation Data Scrubbing: uses domain-specific knowledge

to scrub data Data auditing: discovers rules and relationships by

scanning data (detect outliers). Data cleaning and integration may use up to 50-70%

of the effort and budget

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 26

Load and Refresh

Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, build indxes, etc.

Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse when to refresh

• determined by usage, types of data source, etc. how to refresh

• data shipping: using triggers to update snapshot log table and propagate the updated data to the warehouse

• transaction shipping: shipping the updates in the transaction log

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 27

Integrator

receive changes from the monitors make the data conform to the conceptual schema

used by the warehouse integrate the changes into the warehouse

merge the data with existing data already present resolve possible update anomalies

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 28

Metadata

Structure of the data in DW (data models) Metrics (algorithms for summarization and

aggregation) Mapping from legacy systems to the data

warehouse Data usage statistics Performance statistics

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 29

Metadata Repository (I)

Administrative metadata source database and their contents gateway descriptions warehouse schema, view and derived data definitions dimensions and hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purge rules user profiles, user groups security: user authorization, access control

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 30

Metadata Repository (II)

Business data business terms and definitions ownership of data charging policies

Operational metadata data lineage: history of migrated data and sequence of

transformations applied currency of data: active, archived, purged Monitoring information: warehouse usage statistics, error

reports, audit trails

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 31

Data Marts A data mart (departmental data warehouse) is a

specialized system that brings together the data needed for a department or related applications.

Data marts can be implemented within the data warehouse by creating special, application-specific views.

Data marts can also be implemented as materialized views Departmental subsets that focus on selected subjects.

More sophisticated data marts may use different data representations and include their own OLAP engines

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 32

Other Tools

User interface that allows users to interact with the warehouse query and reporting tools analysis tools data mining tools

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 33

System Design

Capacity planing -- define architecture Integrate servers, storage, clients Design warehouse schema, views Design physical warehouse organization: data placement,

partitioning, access methods Connect sources: gateways, ODBC drivers Design and implement scripts for data extract, load and

refresh Define metadata and populate repository Design and implement end-user applications Roll out warehouse and applications

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 34

Technologies Involved

conceptual data modeling design warehouse schema

integration of data from heterogeneous sources for monitor and integrator

extending relational database techniques multidimensional database and MOLAP

distributed and parallel processing warehouse and OLAP server

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 35

Conceptual Modeling Data Warehouses

Modeling data warehouses: dimensions & measurements Star schema: A single object (fact table) in the

middle connected to a number of objects (dimension tables)

Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

Fact constellations: Multiple fact tables share dimension tables.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 36

Example of Star Schema

DateMonthYear

Date

CustIdCustNameCustCityCustCountry

Cust

Sales Fact Table

Date

Product

Store

Customer

unit_sales

dollar_sales

Yen_sales

Measurements

ProductNoProdNameProdDescCategoryQOH

Product

StoreIDCityStateCountryRegion

Store

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 37

Example of Snowflake Schema

DateMonth

Date

CustIdCustNameCustCityCustCountry

Cust

Sales Fact Table

Date

Product

Store

Customer

unit_sales

dollar_sales

Yen_sales

Measurements

ProductNoProdNameProdDescCategoryQOH

Product

MonthYear

MonthYear

Year

CityState

City

CountryRegion

CountryStateCountry

State

StoreIDCity

Store

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 38

Star Schema versus Snowflake Schema

Star Schema

De-normalized Few attribute tables Simple attribute relationship Bigger attribute tables Less joins

Snowflake Schema

Normalized More attribute tables Complex attribute relationship Smaller attribute tables More joins

Real data warehouses are rarely designed in pure Star or Snowflake schema because of the complex relationships among the modeled objects.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 39

Summary Tables

Data warehouse may store some selected summary data, the pre-aggregated data.

Summary data can store as separate fact tables sharing the same dimension tables with the base fact table.

Summary data can be encoded in the original fact table and dimension tables.

id level date month year0 1 1 1 19981 2 NULL 1 19982 2 NULL 2 19983 3 NULL NULL 1998

DateID ProdID Sales0 1 10001 1 200001 2 400003 1 300000

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 40

OLAP Servers

Relational OLAP: extended relational DBMS that maps operations on multidimensional data to standard relations operations

Multidimensional OLAP: special purpose server that directly implements multidimensional data and operations

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 41

ROLAP versus MOLAP

ROLAP exploits services of relational engine effectively provides additional OLAP services

• design tools for DSS schema• performance analysis tool to pick aggregates to

materialize SQL comes in the way of sequential processing

and columar aggregation Some queries are hard to formulate and can often

be time consuming to execute

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 42

ROLAP versus MOLAP

MOLAP the storage model is an n-dimensional array Front-end multidimensional queries map to server

capabilities in a straightforward way Direct addressing abilities Handling sparse data in array representation is

expensive Poor storage utilization when the data is sparse

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 43

Multidimensional View of Data

Sales volume as a function of product, time, and geography

Pro

duct

Regio

n

month

Dimensions: Product, Region, weekHierarchical summarization paths

Industry Country Year

Category Region Quarter

Product City Month Week

Office Day

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 44

A Sample Data Cube

Total annual salesof TV in China.

Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

China

India

Japan

sum

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 45

Sample Operations

Roll up: summarize data total sales volume last year by product category by region

Roll down, Drill down, drill through: go from higher level summary to lower level summary or detailed data For a particular product category, find the detailed sales

data for each salesperson by date Slice and dice: select and project

Sales of beverages in the West over the last 6 months Pivot: reorient cube

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 46

Cube Operation

SELECT date, product, customer, SUM (amount)

FROM SALES

CUBE BY date, product, customer

Need compute the following Group-Bys

(date, product, customer),

(date,product),(date, customer), (product, customer),

(date), (product) (customer)

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 47

Cuboid Lattice

(B)(A) (C) (D)

(B,C) (B,D) (C,D)(A,D)(A,C)

(A,B,D) (B,C,D)(A,C,D)

(A,B)

( all )

(A,B,C,D)

(A,B,C)

R Data cube can be viewed as a lattice of cuboids

The bottom-most cuboid is the base cube.

The top most cuboid contains only one cell.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 48

Cuboid -- A Formal Definition

Let R be a relation with k+1 attributes X = {A1, A2, …, Ak ,V}.

A cuboid on j attributes S = {Ai1, A i2, …, A ij} is defined as a group-by on attributes Ai1, A i2, …, A ij using aggregate function F(.) applied on attribute V. This cuboid can be represented as a k+1 attribute relation by using the special value ALL for the remaining k-j attributes .

The CUBE on attribute set X is the union of cuboids on all subsets of attributes of X. The cuboid on all attributes in X is called the base cuboid.

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 49

Cube Computation -- Array Based Algorithm

An MOLAP approach: the base cuboid is stored as multidimensional array.

Read in a number of cells to compute partial cuboidsB

{}

A

C

{ABC}{AB}{AC}{BC}

{A}{B}{C}{ }

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 50

View and Materialized views

View derived relation defined in terms of base (stored) relations

Materialized views a view can be materialized by storing the tuples of the

view in the database Index structures can be built on the materialized view

Maintenance is an issue for materialized views recomputation incremental updating

H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 51

Issues Related Materialized Views

Select a set of views to be materialized limited by resource, cannot materialize all the views issues to consider: available resources, overhead with

respect to the workload simple algorithm works reasonably well.

Exploit the materialized views to answer queries Query optimization using views

Efficiently update materialized views during loading and fresh