data warehouse design enrico franconi cs 636. cs 3362 implementing a warehouse monitoring: sending...
TRANSCRIPT
CS 336 2
Implementing a Warehouse
Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ...
CS 336 3
Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out?
Replication tool Dump file Create report ODBC or third-party “wrappers”
CS 336 4
Monitoring Techniques
Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring
CS 336 5
Monitoring Issues
Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ...
Data transformation convert data to uniform format remove & add fields (e.g., add date to get history)
Standards (e.g., ODBC)
Gateways
CS 336 6
Wrapper
Converts data and queries from one data model to another
Extends query capabilities for sources with limited capabilities
DataModel
B
DataModel
A
Queries
Data
Queries SourceWrapper
CS 336 7
Wrapper Generation
Solution 1: Hard code for each source Solution 2: Automatic wrapper generation
WrapperWrapperGenerator
Definition
CS 336 8
Integration
Data Cleaning Data Loading Derived Data
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 336 9
Data Integration
Receive data (changes) from multiple wrappers/monitors and integrate into warehouse
Rule-based Actions
Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc.
CS 336 10
Data Cleaning
Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe
Detect inconsistent, wrong data Attribute values that don’t match
Patch missing, unreadable data Insert default values
Notify sources of errors found
CS 336 11
Data Cleaning
Migration (e.g., yen to dollars) Scrubbing: use domain-specific knowledge (e.g., social
security numbers) Fusion (e.g., mail list, customer merging)
billing DB
service DB
customer1(Joe)
customer2(Joe)
merged_customer(Joe)
CS 336 12
Loading Data in the Warehouse
Incremental vs. refresh Off-line vs. on-line Frequency of loading
At night, 1x a week/month, continuously
Parallel/Partitioned load
CS 336 13
Warehouse Maintenance
Warehouse data materialized view Initial loading View maintenance
Derived Warehouse Data indexes aggregates materialized views
View maintenance
CS 336 14
Materialized Views
Define new warehouse relations using SQL expressions
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
does not existat any source
CS 336 15
Differs from Conventional View Maintenance...
Warehouses may be highly aggregated and summarized
Warehouse views may be over history of base data
Process large batch updates Schema may evolve
CS 336 16
Differs from Conventional View Maintenance...
Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable
CS 336 17
Warehouse Maintenance Anomalies
Materialized view maintenance in loosely coupled, non-transactional environment
Simple example
Sales Comp.
Integrator
DataWarehouse
Sale(item,clerk) Emp(clerk,age)
Sold (item,clerk,age)
Sold = Sale Emp
CS 336 18
Warehouse Maintenance Anomalies
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1) integrator adds Sale (Mary,25)
4. (2) integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
Sales Comp.
Integrator
DataWarehouse
Sale(item,clerk) Emp(clerk,age)
Sold (item,clerk,age)
CS 336 19
Maintenance Anomaly - Solutions
Incremental update algorithms (ECA, Strobe, etc.)
Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary
views are self-maintainable
CS 336 20
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk) Emp(clerk,age)
Inserts into EmpIf Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect
Inserts into SaleMaintain auxiliary view: Emp-clerk,age(Sold)
Deletes from EmpDelete from Sold based on clerk
CS 336 21
Self-Maintainability: Examples
Deletes from SaleDelete from Sold based on {item,clerk}
Unless age at time of sale is relevant
Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set
CS 336 22
Partial Self-Maintainability
Avoid (but don’t prohibit) going to sourcesSold=Sale(item,clerk) Emp(clerk,age)
Inserts into Sale Check if clerk already in Sold, go to source if
not Or replicate all clerks over age 30 Or ...
CS 336 23
Warehouse Specification (ideally)
Extractor/Monitor
Extractor/Monitor
Extractor/Monitor
Integrator
Warehouse
...
Metadata
Warehouse Configuration
Module
View Definitions
Integrationrules
ChangeDetection
Requirements
CS 336 24
Processing
ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 336 25
ROLAP Server
Relational OLAP Server
relationalDBMS
ROLAPserver
tools
utilities
sale prodId date sump1 1 62p2 1 19p1 2 48
Special indices, tuning;
Schema is “denormalized”
CS 336 26
MOLAP Server
Multi-Dimensional OLAP Server
multi-dimensional
server
M.D. tools
utilitiescould also
sit onrelational
DBMS
Pro
du
ctCity
Date1 2 3 4
milk
soda
eggs
soap
AB
Sales
CS 336 27
Index Structures (sketch)
Traditional Access Methods B-trees, hash tables, R-trees, grids, …
Popular in Warehouses inverted lists bit map indexes join indexes text indexes
CS 336 28
What to Materialize?
Store in warehouse results useful for common queries
Example:day 2
c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
c1p1 110p2 19
129
. . .
total sales
materialize
CS 336 29
Materialization Factors
Type/frequency of queries Query response time Storage cost Update cost
CS 336 30
Cube Aggregates Lattice
city, product, date
city, product city, date product, date
city product date
all
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
129
use greedyalgorithm todecide whatto materialize
CS 336 32
Dimension Hierarchies
city, product
city, product, date
city, date product, date
city product date
all
state, product, date
state, date
state, product
state
not all arcs shown...
CS 336 33
Interesting Hierarchy
all
years
quarters
months
days
weeks
time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000
conceptualdimension table
CS 336 34
Managing
Metadata Warehouse Design Tools
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
CS 336 35
Metadata
Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...
CS 336 36
Metadata
Business business terms & definition data ownership, charging
Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails
CS 336 37
Design Summary
What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?
CS 336 38
Tools
Development design & edit: schemas, views, scripts, rules, queries, reports
Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning
Warehouse Management performance monitoring, usage patterns, exception reporting
System & Network Management measure traffic (sources, warehouse, clients)
Workflow Management “reliable scripts” for cleaning & analyzing data
CS 336 39
Current State of Industry
Extraction and integration done off-line Usually in large, time-consuming, batches
Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost
Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything
CS 336 40
State of Commercial Practice ... Connectivity to sources
Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub
Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data
Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump
CS 336 41
… State of Commercial Practice ... Multidimensional
Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System
Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2
ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys
CS 336 42
… State of Commercial Practice Query/Reporting
Environments IBM DataGuide SAS Access CA Visual Express
Platinum Forest&Trees Informix ViewPoint
Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects
Lots and lots of consulting!!
CS 336 43
Future Directions
Better performance Larger warehouses Easier to use What are companies & research labs
working on?
CS 336 44
Research (1)
Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)
CS 336 45
Research (2)
Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling