olap and data warehousing

46
OLAP and Data Warehousing Surajit Chaudhuri Microsoft Research, Redmond, WA, USA [email protected] Umeshwar Dayal Hewlett-Packard Labs., Palo Alto, CA, USA [email protected]

Upload: dalia

Post on 14-Jan-2016

61 views

Category:

Documents


1 download

DESCRIPTION

OLAP and Data Warehousing. Surajit Chaudhuri Microsoft Research, Redmond, WA, USA [email protected] Umeshwar Dayal Hewlett-Packard Labs., Palo Alto, CA, USA [email protected]. What is OLAP?. O n- L ine A nalytical P rocessing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OLAP and Data Warehousing

OLAP and Data Warehousing

Surajit ChaudhuriMicrosoft Research, Redmond, WA, USA

[email protected]

Umeshwar DayalHewlett-Packard Labs., Palo Alto, CA, USA

[email protected]

Page 2: OLAP and Data Warehousing

23© Surajit Chaudhuri, Umeshwar Dayal

What is OLAP? On-Line Analytical Processing Information technology to help the knowledge

worker (executive, manager, analyst) make faster and better decisions.

OLAP is an element of decision support systems (DSS).

Page 3: OLAP and Data Warehousing

3

Running Example: Car Sales Cars: carId, make, model, color

Dealers: dealerId, city, state

Time of Sale: tid, year, month, day

Sales: carId, dealerId, tid, price

Page 4: OLAP and Data Warehousing

4

OLTP Queries: Examples

create a new sales record that indicates that a red VW Golf was sold in Boston, MA

see how many black and silver VW Passats were sold at dealership #123 on April 11th 2005

Page 5: OLAP and Data Warehousing

5

OLAP Queries: Examples Analyze comparative sales of the different colors

of VW Golf by state

See which months are particularly favorable to the sale of different VW models and colors

Rank VW dealerships by revenue, displaying a ranked list of dealerships and % differences in sales between each dealership and the one ranked 1 place higher

Page 6: OLAP and Data Warehousing

6

OLAP vs. OLTP OLAP Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized,

Multidimensional Ad hoc Complex query Read mostly Lots of scans Millions Hundreds 100 GB - TB Query throughput, response

OLTP Clerk, IT professional Day to day operations Application-oriented (E-R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, simple transaction

Read/write Index/hash on prim. key Tens Thousands 100 MB - GB Trans. throughput

User Function DB design

Data View Usage Unit of work Access Operations # Records accessed # Users Db size Metric

© Surajit Chaudhuri, Umeshwar Dayal

Page 7: OLAP and Data Warehousing

7

OLAP Queries: Challenges Many AND, OR in the WHERE clause Self-join, nested sub-queries

» Last year’s sales vs this year’s sales for each product» Show reps for whom every sale has been more than $15000

Extensive use of aggregation, often on related datasets Aggregation over time periods Ranking Use of statistical functions Very large datasets Expectation of an interactive response time

Page 8: OLAP and Data Warehousing

8© Surajit Chaudhuri, Umeshwar Dayal

OLAP Query Tools Goal of OLAP is to support ad-hoc querying for

the business analyst (Power user) Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with

warehouse data» Large data set» Semantically enriched to understand business terms

(e.g., time, geography)» Combined with reporting features

Multidimensional view of data is the foundation of OLAP.

Page 9: OLAP and Data Warehousing

9© Surajit Chaudhuri, Umeshwar Dayal

Multidimensional Data Model Database is a set of facts (points) in a

multidimensional space A fact has a measure dimension

» quantity that is analyzed, e.g., sale amount, budget A set of dimensions with respect to which data is

analyzed» e.g., store, product, date associated with a sale amount

Dimensions form a sparsely populated coordinate system

Each dimension has a set of attributes» e.g., owner, city and county of store

Page 10: OLAP and Data Warehousing

10

Attribute Hierarchies Attributes of a dimension may be related An m:1 dependency is most common Dependency graph may be:

» Hierarchy: e.g., city -> state -> country

» Lattice: date -> month -> yeardate -> week -> year

Hierarchies are most common Dependencies influence choice of operations and

data representation

© Surajit Chaudhuri, Umeshwar Dayal

Page 11: OLAP and Data Warehousing

11© Surajit Chaudhuri, Umeshwar Dayal

State

State

Col

orC

olor

DateDate

Fact data: Sales volume in $100Sales volume in $100

RedGreenBlue

WhiteSilver Black

1 2 3 4 76 5

WI CA

15

10

12

20

50

10

DimensionsColor, State, Date

Attributesdate (year, month, day)

Attribute Hierarchies and Lattice Industry Country Year

Category State Quarter

Product City Month Week

NY

Date

Multidimensional Data Sales volume as a function of product, time, geography

Page 12: OLAP and Data Warehousing

12© Surajit Chaudhuri, Umeshwar Dayal

ROLAP and MOLAP Relational OLAP (ROLAP)

» Relational and Specialized Relational DBMS to store and manage warehouse data

» OLAP middleware to support missing pieces– Optimize for each DBMS backend – Aggregation Navigation Logic– Additional tools and services

Multidimensional OLAP (MOLAP) » Array-based storage structures» Direct access to array data structures

Page 13: OLAP and Data Warehousing

13

Multiple Aggregations Create a 2-dimensional spreadsheet that shows

sum of sales by year as well as by model of car Each subtotal requires a separate aggregate query

YEAR

STATE

Sum By State

Sum by

Year

© Surajit Chaudhuri, Umeshwar Dayal

Page 14: OLAP and Data Warehousing

14

Example: Multiple Aggregations

WI CA Total

2003 63 81 144

2004 38 107 145

2005 75 35 110

Total 176 223 399

Page 15: OLAP and Data Warehousing

15

Generalization: The Data Cube Base tuples Aggregate tuples:

» one aggregation for each subset of dimensions (powerset)

» exponential number of subsets, but can optimize the computation

Example» N = 3 dimensions

– model = {Golf, Jetta}– color = {red, black, white}– state = {NY, CA, WI}

» How many aggregate tuples in the data cube?– face – 1D agg; edge – 2D agg; corner – 3D agg

Page 16: OLAP and Data Warehousing

16

Operations on Multidimensional Data Model

Aggregation (roll-up) of detailed data to create summary data

Navigation to detailed data (drill-down) from summary Selection (slice) defines a subcube

– Project the cube on fewer dimensions by specifying coordinates of remaining dimensions

– e.g., sales where state = NY and month = Jan

Calculation– Within a dimension, e.g., (sales - expense) by state– Across dimensions

Ranking – top 3% of states by average sales

Window Queries

© Surajit Chaudhuri, Umeshwar Dayal

Page 17: OLAP and Data Warehousing

17

Roll-up and Drill-Down Roll-Up: Use of aggregation

» dimension reduction: – e.g., total sales by state by color– e.g., total sales by state

» navigating attribute hierarchy: – e.g., sales by city -> total sales by state -> total sales by

country– e.g., total sales by city and year -> total sales by state and year

-> total sales by country

Drill-Down: Inverse operation of roll-up» Provides the data set that was aggregated

– e.g., show “base” data for total sales figure for CA state

© Surajit Chaudhuri, Umeshwar Dayal

Page 18: OLAP and Data Warehousing

18

Slice and Dice What colors of Golf are not doing so well?

Select color, sum(price) From SALES Where model = ‘Golf’ slicing Group By color dicing

Keep slicing if results are uniform

Page 19: OLAP and Data Warehousing

19

More ExamplesQ: Given a query, which values from the CUBE do

we need to retrieve?

A: To answer a query Q use tuples T s.t.» If Q groups by A, T must have a non-* value in its

component for A» If Q slices by A = b, T must have the value b (not * or

any other value) in its component for A» If Q neither groups nor slices by A, then T has to have

* in its component for A

Page 20: OLAP and Data Warehousing

20© Surajit Chaudhuri, Umeshwar Dayal

Cit

yC

ity

ProductProductM

onth

Mon

th

Pro

duct

Pro

duct

MonthMonthFact data: Sales volume in $100Sales volume in $100

JuiceColaMilk

CreamToothpaste

Soap

1 2 3 4 76 5

LASF

NY

15

10

12

20

50

10

CitCit

yyPivot (Rotate)

Result: cross tabulation

Page 21: OLAP and Data Warehousing

21© Surajit Chaudhuri, Umeshwar Dayal

Warehouse Database Schema Entity-Relationship design techniques not

appropriate Design should reflect multidimensional view Typical schemas:

» Star Schema» Snowflake Schema» Fact Constellation Schema

Page 22: OLAP and Data Warehousing

22© Surajit Chaudhuri, Umeshwar Dayal

Example of a Star Schema

Fact table

Order OrderNoOrderDate

Customer

CustomerNoCustomerNameCustomerAddressCity

SalespersonSalespersonIDSalespersonNameCityQuota

ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH

City

CityNameStateCountry

Date

DateKeyDateMonthYear

OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice

Product

Page 23: OLAP and Data Warehousing

23© Surajit Chaudhuri, Umeshwar Dayal

Star Schema and Variants A single fact table and a single table for each

dimension Generated keys are used for performance and

maintenance reasons Fact constellation: Multiple Fact tables that share

common dimension tables» Example: ProjectedExpense and ActualExpense may

share dimensional tables Snowflake Schema: Represents dimensional

hierarchy by normalization

Page 24: OLAP and Data Warehousing

24© Surajit Chaudhuri, Umeshwar Dayal

Example of a Snowflake Schema

Fact table

OrderNoSalespersonIDCustomerNoDateKeyCityNameProdNoQuantityTotalPrice

Order OrderNoOrderDate

Customer

CustomerNoCustomerNameCustomerAddressCity

SalespersonSalespersonIDSalespesonNameCityQuota

ProductProdNoProdNameProdDescrCategoryUnitPriceQOH

CategoryCategoryNameCategoryDescr

City

CityNameState

State

StateNameCountry

Date

DateKeyDateMonth

Month

MonthYear

Year

Year

Page 25: OLAP and Data Warehousing

25

Performance Considerations Normalization for dimension tables

» Read-only data, so no update anomalies» Fewer joins – better performance

Pre-computation of summary tables» Re-use can speed up performance» How can we use pre-computed results effectively?

Data is very large, dimension data often sparse» Crucial to use indexes effectively » Need for new indexing techniques: bitmap indexes, join

indexes

Page 26: OLAP and Data Warehousing

26© Surajit Chaudhuri, Umeshwar Dayal

Bit Map Index An alternative representation of RID-list Comparison, join and aggregation operations are

reduced to bit arithmetic Specially advantageous for low-cardinality

domains» Significant reduction in space and I/O (30:1)» Adapted for higher cardinality domains » Compression (e.g., run-length encoding) exploited » Upper Bound of 2R words for any bitmap over R rows

[Hasan & Sinha, 1997]

Page 27: OLAP and Data Warehousing

27

Bitmap Index Example

custid name gender rating

112 Joe M 3

115 Ram M 5

119 Sue F 5

116 Woo M 4

M F

1 0

1 0

0 1

1 0

1 2 3 4 5

0 0 1 0 0

0 0 0 0 1

0 0 0 0 1

0 0 0 1 0

Page 28: OLAP and Data Warehousing

28© Surajit Chaudhuri, Umeshwar Dayal

Join Index Traditional index maps the value in a column to a

list of rows with that value Join index maintain relationships between

attribute value of a dimension and the matching rows in the fact table

Join index may span multiple dimensions (composite join index)» Use join index to identify regions of cartesian product

that are of interest» Few people in Southern California may buy umbrellas

Page 29: OLAP and Data Warehousing

29© Surajit Chaudhuri, Umeshwar Dayal

Algorithm Using Bitmapped Join Indexes

[O’Neil&Graefe95] Maintain bit mapped join indexes between each

dimension table and the fact table To answer a query over multiple dimensions

» Take intersection of join indexes until the set of candidate fact tuples is small

» Do foreign key joins with rest of the dimension tables» Look up the fact table

Page 30: OLAP and Data Warehousing

30© Surajit Chaudhuri, Umeshwar Dayal

Join Index over Star Schema

Fact table

Order OrderNoOrderDate

Customer

CustomerNoCustomerNameCustomerAddressCity

SalespersonSalespersonIDSalespesonNameCityQuota

ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH

City

CityNameStateCountry

DateDateKeyDateMonthYear

OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice

Product

Page 31: OLAP and Data Warehousing

31© Surajit Chaudhuri, Umeshwar Dayal

ROLAP:Handling of Aggregate Views

Important component for ROLAP Servers Choice of aggregate views to materialize Physical representation of Materialized Views in

the star schema Logic for Aggregation Navigation

» make optimum use of materialized aggregates to answer a query

Page 32: OLAP and Data Warehousing

32© Surajit Chaudhuri, Umeshwar Dayal

ROLAP: Choice of Aggregate Views to Materialize

Storage can increase dramatically if precomputed views are not chosen properly

Must take into account queries in the workload, their frequencies and their costs

The decision must be taken in the broader context of physical database design» e.g., should take into account the choice of indexes

Heuristic approaches adopted in products

Page 33: OLAP and Data Warehousing

33© Surajit Chaudhuri, Umeshwar Dayal

ROLAP: Using Materialized Views Through Selection

A query can use a view through a selection if » Each selection condition C on each dimension d

in the query is » Logically implies a condition C’ on dimension

d in the view Example: A view has sum(sales) by product and

by year for products introduced after 1991» OK to use for sum(sales) by product for

products introduced after 1992 » CANNOT use for sum(sales) for products

introduced after 1989

Page 34: OLAP and Data Warehousing

34© Surajit Chaudhuri, Umeshwar Dayal

Using Materialized Views through Group By (Roll Up)

The view V may be applicable via roll-up if for every grouping attribute g of the query Q:» Q has Group By a1,..,g, an» V has Group By a1,..,h, an» Attribute g is higher than h in the attribute

hierarchy » Aggregation functions are distributive

Example: Compute “sum(sales) by category” from the view “sum(sales) by product”

Page 35: OLAP and Data Warehousing

35© Surajit Chaudhuri, Umeshwar Dayal

-- W.H. Inmon, Building the Data Warehouse, 1992.

Data Warehouse A decision support database that is maintained

separately from the organization’s operational databases.

A data warehouse is a – subject-oriented, – integrated,– time-varying, – non-volatile

collection of data that is used primarily in organizational decision making.

Page 36: OLAP and Data Warehousing

36© Surajit Chaudhuri, Umeshwar Dayal

Why Separate Data Warehouse Performance

» Op dbs designed & tuned for known trans. workloads.» Complex OLAP queries would degrade performance

for operational transactions. » Special data organization, access & implementation

methods needed for multidimensional views & queries. Function

» Missing data: Decision support requires historical data, which op dbs do not typically maintain.

» Data consolidation: Decision support requires data consolidation (aggregation, summarization) from many heterogeneous sources: op dbs, external sources.

» Data quality: Different sources typically use inconsistent data representations, codes, and formats, which have to be reconciled.

Page 37: OLAP and Data Warehousing

37© Surajit Chaudhuri, Umeshwar Dayal

Data Warehousing Architecture

Data sources

External sources Extract

TransformTransport

Data Warehouse

Data Marts

OLAPOLAP Servers

Data Mining

Query/Reporting

Metadata Repository

Monitoring & Administration

Front-End Tools

ServeOperational dbs

Page 38: OLAP and Data Warehousing

38© Surajit Chaudhuri, Umeshwar Dayal

Data Warehouse vs. Data Marts Enterprise data warehouse: collects all information

about subjects (customers, products, sales, assets, personnel) that span the entire organization. » Requires extensive business modeling. » May take years to design and build.

Data Marts: Departmental subsets that focus on selected subjects. » Marketing data mart: customer, products, sales. » Faster roll out, but complex integration in the long run.

Virtual warehouse: views over operational dbs» materialize some summary views for efficient query

processing » easier to build» requisite excess capacity on operational db servers.

Page 39: OLAP and Data Warehousing

39© Surajit Chaudhuri, Umeshwar Dayal

Three-Tier Architecture Warehouse database server

» almost always a relational DBMS; rarely flat files. OLAP servers

» Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations.

» Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.

Clients» Query and reporting tools.» Analysis tools.» Data mining tools.

Page 40: OLAP and Data Warehousing

40© Surajit Chaudhuri, Umeshwar Dayal

Populating & Refreshing the Warehouse

Data extraction Data cleaning Data transformation

» Convert from legacy/host format to warehouse format Load

» Sort, summarize, consolidate, compute views, check integrity, build indexes, partition

Refresh» Propagate updates from sources to the warehouse.

Page 41: OLAP and Data Warehousing

41© Surajit Chaudhuri, Umeshwar Dayal

Data Cleaning Why?

» Data warehouse contains data that is analyzed for business decisions

» More data and mulitple sources could mean more errors » Results in incorrect analysis

Detecting data anomalies and rectifying them early has huge payoffs

Important to identify tools that work together well Long Term Solution

» Change business practices and data entry tools» Repository for metadata

Page 42: OLAP and Data Warehousing

42© Surajit Chaudhuri, Umeshwar Dayal

Load Issues:

» huge volumes of data to be loaded» small time window (usually at night) when the

warehouse can be taken off-line» when to build indexes and summary tables» allow system administrator to monitor status, cancel

suspend, resume load, or change load rate» restart after failure with no loss of data integrity.

Techniques:» batch load utility: sort input records on clustering key

and use sequential I/O; build indexes and derived tables» sequential loads still too long (~100 days for TB)» use parallelism and incremental techniques.

Page 43: OLAP and Data Warehousing

43© Surajit Chaudhuri, Umeshwar Dayal

Parallel Load

Pipelined and partitioned parallelism

[Barclay, Barnes, Gray, Sundaresan: Loading Databases Using Dataflow Parallelism]

Source tables Sort runs Merge runs Table insert Target tables

Build index record

Scan

Sort runs Merge runs Index insert Target index

Page 44: OLAP and Data Warehousing

44© Surajit Chaudhuri, Umeshwar Dayal

Incremental Load Full load may still take too long.

» entire load is a (long) batch transaction» replace old table with new after transaction commits» use periodic checkpoints; after failure, restart from last

checkpoint. Use incremental loads during refresh to reduce data

volume » insert only updated tuples» now, incremental load conflicts with queries» break into sequence of shorter transactions (every

~1000 records, every few seconds)» coordinate this sequence of transactions: must ensure

consistency between base tables and derived tables & indices.

Page 45: OLAP and Data Warehousing

45© Surajit Chaudhuri, Umeshwar Dayal

Refresh Issues:

» when to refresh– on every update: too expensive, only necessary if

OLAP queries need current data (e.g., up-to-the-minute stock quotes)

– periodically (e.g., every 24 hours, every week) or after “significant” events

– refresh policy set by administrator based on user needs and traffic

– possibly different policies for different sources.» how to refresh.

Page 46: OLAP and Data Warehousing

46© Surajit Chaudhuri, Umeshwar Dayal

Refresh Techniques Full extract from base tables

» read entire source table or database: expensive» may be the only choice for legacy databases or files.

Incremental techniques (related to work on active dbs)» detect & propagate changes on base tables: replication

servers – snapshots & triggers (Oracle)– transaction shipping (Sybase)

» logical correctness– computing changes to star tables – computing changes to derived and summary tables– optimization: only significant changes

» transactional correctness: incremental load.