data warehouse design enrico franconi cs 636. cs 3362 implementing a warehouse monitoring: sending...

46
Data Warehouse Design Enrico Franconi CS 636

Upload: amberly-hicks

Post on 17-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Data Warehouse Design

Enrico Franconi

CS 636

CS 336 2

Implementing a Warehouse

Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ...

CS 336 3

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out?

Replication tool Dump file Create report ODBC or third-party “wrappers”

CS 336 4

Monitoring Techniques

Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring

CS 336 5

Monitoring Issues

Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ...

Data transformation convert data to uniform format remove & add fields (e.g., add date to get history)

Standards (e.g., ODBC)

Gateways

CS 336 6

Wrapper

Converts data and queries from one data model to another

Extends query capabilities for sources with limited capabilities

DataModel

B

DataModel

A

Queries

Data

Queries SourceWrapper

CS 336 7

Wrapper Generation

Solution 1: Hard code for each source Solution 2: Automatic wrapper generation

WrapperWrapperGenerator

Definition

CS 336 8

Integration

Data Cleaning Data Loading Derived Data

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

CS 336 9

Data Integration

Receive data (changes) from multiple wrappers/monitors and integrate into warehouse

Rule-based Actions

Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc.

CS 336 10

Data Cleaning

Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe

Detect inconsistent, wrong data Attribute values that don’t match

Patch missing, unreadable data Insert default values

Notify sources of errors found

CS 336 11

Data Cleaning

Migration (e.g., yen to dollars) Scrubbing: use domain-specific knowledge (e.g., social

security numbers) Fusion (e.g., mail list, customer merging)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

CS 336 12

Loading Data in the Warehouse

Incremental vs. refresh Off-line vs. on-line Frequency of loading

At night, 1x a week/month, continuously

Parallel/Partitioned load

CS 336 13

Warehouse Maintenance

Warehouse data materialized view Initial loading View maintenance

Derived Warehouse Data indexes aggregates materialized views

View maintenance

CS 336 14

Materialized Views

Define new warehouse relations using SQL expressions

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

CS 336 15

Differs from Conventional View Maintenance...

Warehouses may be highly aggregated and summarized

Warehouse views may be over history of base data

Process large batch updates Schema may evolve

CS 336 16

Differs from Conventional View Maintenance...

Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable

CS 336 17

Warehouse Maintenance Anomalies

Materialized view maintenance in loosely coupled, non-transactional environment

Simple example

Sales Comp.

Integrator

DataWarehouse

Sale(item,clerk) Emp(clerk,age)

Sold (item,clerk,age)

Sold = Sale Emp

CS 336 18

Warehouse Maintenance Anomalies

1. Insert into Emp(Mary,25), notify integrator

2. Insert into Sale (Computer,Mary), notify integrator

3. (1) integrator adds Sale (Mary,25)

4. (2) integrator adds (Computer,Mary) Emp

5. View incorrect (duplicate tuple)

Sales Comp.

Integrator

DataWarehouse

Sale(item,clerk) Emp(clerk,age)

Sold (item,clerk,age)

CS 336 19

Maintenance Anomaly - Solutions

Incremental update algorithms (ECA, Strobe, etc.)

Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary

views are self-maintainable

CS 336 20

Self-Maintainability: Examples

Sold(item,clerk,age) =

Sale(item,clerk) Emp(clerk,age)

Inserts into EmpIf Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect

Inserts into SaleMaintain auxiliary view: Emp-clerk,age(Sold)

Deletes from EmpDelete from Sold based on clerk

CS 336 21

Self-Maintainability: Examples

Deletes from SaleDelete from Sold based on {item,clerk}

Unless age at time of sale is relevant

Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set

CS 336 22

Partial Self-Maintainability

Avoid (but don’t prohibit) going to sourcesSold=Sale(item,clerk) Emp(clerk,age)

Inserts into Sale Check if clerk already in Sold, go to source if

not Or replicate all clerks over age 30 Or ...

CS 336 23

Warehouse Specification (ideally)

Extractor/Monitor

Extractor/Monitor

Extractor/Monitor

Integrator

Warehouse

...

Metadata

Warehouse Configuration

Module

View Definitions

Integrationrules

ChangeDetection

Requirements

CS 336 24

Processing

ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

CS 336 25

ROLAP Server

Relational OLAP Server

relationalDBMS

ROLAPserver

tools

utilities

sale prodId date sump1 1 62p2 1 19p1 2 48

Special indices, tuning;

Schema is “denormalized”

CS 336 26

MOLAP Server

Multi-Dimensional OLAP Server

multi-dimensional

server

M.D. tools

utilitiescould also

sit onrelational

DBMS

Pro

du

ctCity

Date1 2 3 4

milk

soda

eggs

soap

AB

Sales

CS 336 27

Index Structures (sketch)

Traditional Access Methods B-trees, hash tables, R-trees, grids, …

Popular in Warehouses inverted lists bit map indexes join indexes text indexes

CS 336 28

What to Materialize?

Store in warehouse results useful for common queries

Example:day 2

c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

total sales

materialize

CS 336 29

Materialization Factors

Type/frequency of queries Query response time Storage cost Update cost

CS 336 30

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

use greedyalgorithm todecide whatto materialize

CS 336 31

Dimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

CS 336 32

Dimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

CS 336 33

Interesting Hierarchy

all

years

quarters

months

days

weeks

time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000

conceptualdimension table

CS 336 34

Managing

Metadata Warehouse Design Tools

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

CS 336 35

Metadata

Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...

CS 336 36

Metadata

Business business terms & definition data ownership, charging

Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails

CS 336 37

Design Summary

What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?

CS 336 38

Tools

Development design & edit: schemas, views, scripts, rules, queries, reports

Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning

Warehouse Management performance monitoring, usage patterns, exception reporting

System & Network Management measure traffic (sources, warehouse, clients)

Workflow Management “reliable scripts” for cleaning & analyzing data

CS 336 39

Current State of Industry

Extraction and integration done off-line Usually in large, time-consuming, batches

Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost

Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything

CS 336 40

State of Commercial Practice ... Connectivity to sources

Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub

Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data

Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump

CS 336 41

… State of Commercial Practice ... Multidimensional

Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System

Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2

ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys

CS 336 42

… State of Commercial Practice Query/Reporting

Environments IBM DataGuide SAS Access CA Visual Express

Platinum Forest&Trees Informix ViewPoint

Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects

Lots and lots of consulting!!

CS 336 43

Future Directions

Better performance Larger warehouses Easier to use What are companies & research labs

working on?

CS 336 44

Research (1)

Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)

CS 336 45

Research (2)

Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling

CS 336 46

Conclusions

Massive amounts of data and complexity of queries will push limits of current warehouses

Need better systems: easier to use provide quality information