olap cs 543 – data warehousing. cs 543 - data warehousing (sp 2007-2008) - asim karim @ lums2...
Post on 21-Dec-2015
224 views
TRANSCRIPT
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 2
Where Does OLAP Fit In? (1)
OLAP = On-line analytical processing. OLAP is a characterization of applications, not a database
design technique. Idea is to provide very fast response time in order to
facilitate iterative decision-making. Analytical processing requires access to complex
aggregations (as opposed to record-level access).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 3
Where Does OLAP Fit In? (2)
Information is conceptually viewed as “cubes” for simplifying the way in which users access, view, and analyze data.
Quantitative values are known as “facts” or “measures.”e.g., sales $, units sold, etc.
Descriptive categories are known as “dimensions.” e.g., geography, time, product, scenario (budget or actual),
etc. Dimensions are often organized in hierarchies that represent levels
of detail in the data (e.g., UPC, SKU, product subcategory, product category, etc.).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 4
OLAP FASMI TestFast: Delivers information to the user at a fairly constant rate.
Most queries should be delivered to the user in five seconds or less.
Analysis: Performs basic numerical and statistical analysis of the data, pre-defined by an application developer or defined ad hoc by the user.
Shared: Implements the security requirements necessary for sharing potentially confidential data across a large user population.
Multi-dimensional: The essential characteristic of OLAP.Information: Accesses all the data and information necessary
and relevant for the application, wherever it may reside and not limited by volume.
...from the OLAP Report by Pendse and Creeth.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 5
Need for Multidimensional Analysis
A simple analysis How many units of product A did we sell in the store in
DHA, Lahore
Typically, decision support requires more complex analyses
How much revenue did the new product X generate during the last three months, broken down by individual months, in the Southern Region, by individual stores, broken down by the promotions, compared to estimates, and compared to the previous version of the the product?
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 6
Kinds of Analyses
Roll-ups to provide summaries and aggregates along the hierarchies of the dimensions
Drill-downs from the top level to the lowest along the hierarchies of the dimensions
Calculations involving facts and metrics Algebraic equations involving key performance
indicators Moving averages and growth percentages Trend analyses using statistical methods
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 9
OLAP?
The name On-Line Analytical Processing was coined in a paper by E.F. Codd in 1993 (“Providing On-Line Analytical Processing for User Analysts”)
A definition OLAP is a category of software technology that enables
analysts, managers, and executives to gain insight into data through fast, consistent, interactive access in a a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 13
Some Queries
Display the total sales of all products for past five years in all stores
Compare total sales for all stores, product by product, between years 2000 and 1999.
Show comparison of sales by individual stores, product by product, between years 2000 and 1999 only for those products with reduced sales.
Show the results of the previous queries, but rotating the columns with rows
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 14
Hypercubes
Multi-dimension cubes Hard to visualize and display beyond three dimensions
Multi-dimensional domain structure (MDS) Represents each dimension as a line showing the values
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 21
OLAP Models/Implementations
MOLAP: OLAP implemented with a multi-dimensional database.
ROLAP: OLAP implemented with a relational database.
HOLAP: OLAP implemented with a hybrid of multi-dimensional and relational database technologies.
DOLAP: OLAP implemented for desktop decision support environments.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 23
MOLAP Implementations
OLAP has historically been implemented through use of multi-dimensional databases (MDDs).
Dimensions are key business factors for analysis: geographies (zip, state, region,...) products (item, product category, product department,...) dates (day, week, month, quarter, year,...)
Very high performance via fast look-up into “cube” data structure to retrieve pre-calculated results.
“Cube” data structures allow pre-calculation of aggregate results for each possible combination of dimensional values.
Use of application programming interface (API) for access via front-end tools.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 25
MOLAP Implementations
Need to consider both maintenance and storage implications when designing strategy for when to build cubes.
Maintenance Considerations: Every data item received into MDD must be aggregated into every cube (assuming “to-date” summaries are maintained).
Storage Considerations: Although cubes get much smaller (e.g., more dense) as dimensions get less detailed (e.g., year vs. day), storage implications for building hundreds of cubes can be significant.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 26
MOLAP Implementations
Typically outperform relational database technology because all answers are pre-computed into cubes (and overhead for accessing cubes is very low).
Difficult to scale because of combinatorial explosion in the number and size of cubes when dimensions of significant cardinality are required.
Beyond tens (sometimes small hundreds) of thousands of entries in a single dimension will break the MOLAP model because the pre-computed cube model does not work well when the cubes are very sparse in the population of individual cells.
See www.olapreport.com/DataExplosion.htm
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 27
Virtual CubesVirtual cubes are used when there is a need to join
information from two dissimilar cubes that share one or more common dimensions.
Similar to a relational view; two (or more) cubes are linked along common dimension(s).
Often used to save space by eliminating redundant storage of information.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 28
Partitioned Cubes
One logical cube of data can be spread across multiple physical cubes on separate (or same) servers.
The divide-and-conquer approach of partitioned cubes helps to mitigate the scalability limitations of a MOLAP environment.
Ideal cube partitioning is completely invisible to end users.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 29
ROLAP Implementations
Advances in database technologies and front-end tools have begun to allow deployment of OLAP using ANSI SQL RDBMS implementations.
ROLAP facilitates deployment of much larger dimension tables than MOLAP implementations.
Front-end tools to facilitate GUI access to multi-dimensional analysis capabilities.
Aggregate awareness allows exploitation of pre-built summary tables for some front-end tools.
Star schema designs are often used to facilitate OLAP against relational databases.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 31
Simplified Third Normal Form (Retail)
ZONE REGION
zip _x_SMSA1
ZIP ZONE ZIP SMSA ZIP ADI QTR YR
STORE # ADDRESS ZIP ...WEEK QTR
DATE WEEK
RECEIPT # STORE # DATE ...
DATE WEATHER
RECEIPT #ITEM # ... $
ITEM # CATEGORYITEM # MFCTR
DEPTCATEGORY
zip _x_adi year
quarter
week
sale_detail
item_x_categoryitem_x_mfctr
category_x_dept
M
1M
1 1 1M
1M11
M M
1 M 1
M
M M1 1
M1
STORE #1
M
M M
ZONE REGION
zip _x_SMSA1
ZIP ZONE ZIP SMSA ZIP ADI QTR YR
STORE # ADDRESS ZIP ...WEEK QTR
DATE WEEK
RECEIPT # STORE # DATE ...
DATE WEATHER
RECEIPT #ITEM # ... $
ITEM # CATEGORYITEM # MFCTR
DEPTCATEGORY
zip _x_adi year
quarter
week
sale_detail
item_x_categoryitem_x_mfctr
category_x_dept
M
1M
1 1 1M
1M11
M M
1 M 1
M
M M1 1
M1
STORE #1
M
M M
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 32
Simplified Star Schema
ITEM# CATEGORY DEPT MFCTR ...
ITEM# STORE# DATERECEIPT# ...
M
1
Fact Table
Product Dimension Table
STORE# ADDRESS ZIP ADI SMSA ZONE
1
M
Geography Dimension Table
REGION
$
DATE WEEK QUARTER YEAR ...
Calendar Dimension Table
1
M
A vastly simplified model ... may even summarize out receipt # .....
STORE# DATE WEATHER
1
M
1
M
ITEM# CATEGORY DEPT MFCTR ...
ITEM# STORE# DATERECEIPT# ...
M
1
Fact Table
Product Dimension Table
STORE# ADDRESS ZIP ADI SMSA ZONE
1
M
Geography Dimension Table
REGION
$
DATE WEEK QUARTER YEAR ...
Calendar Dimension Table
1
M
A vastly simplified model ... may even summarize out receipt # .....
STORE# DATE WEATHER
1
M
1
M
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 33
Simplified Star Schema
A vastly simplified physical data model!
Collapse dimensional hierarchies into a single table for each dimension and create a single fact table from the header and detail records:
Fewer tables. Fewer joins to get results.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 34
Star Schema for High Performance
Business question: How many $ in raincoats did I sell in the first week of January through stores in Boston?
Assume: 4 Billion rows in fact table. 20 different kinds (size, color, style) of raincoats
(product category) out of 50,000 UPCs in store. 8 stores out of 400 are in BOSTON SMSA. 2 years of POS history in DBMS.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 35
Star Schema for High PerformanceSimple (poor performance) approach to query
execution:
1. Join item table with filtering on raincoat product category (very selective) to fact table.
2. Join date table with filtering by week (next most selective) to result table.
3. Join store table with filtering on store to result table from step 2.
4. Aggregate.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 36
Star Schema for High Performance
Advanced (better performance) approach to query execution:
1. Cartesian product join between dimensional tables.* Result is 20 x 8 x 7 = 1,120 rows.
2. Use composite index on item:store:day into fact table for very selective access.* Access less than 0.00000008 percent of data in fact table!
Sophisticated cost-based optimizers will figure this out.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 37
Forcing a Cartesian Product Join
Add an addition “join_value” column in each dimensional table.
Set join_value to same value in all rows of the dimensional tables.
Add additional where clause predicates joining on this column between dimensional tables.
NOTE: This shouldn't be necessary with a “smart” optimizer.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 38
Forcing a Cartesian Product Join
Sample code:
select sum(sales.sales_amt)from d_sales_detail ,store ,item ,periodwhere d_sales_detail.store_id = store.store_id and d_sales_detail.item_id = item.item_id and d_sales_detail.day_dt = period.day_dt and period.day_dt between '23-NOV-2000' and '24-DEC-2000' and item.trade_style_cd = 'BARBIE' and store.state_cd = 'CA' and store.join_value = period.join_value and store.join_value = item.join_value and period.join_value = item.join_value;
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 39
Star Schema for High PerformanceProblem: What if I want to know raincoat sales in first
week of January regardless of store?
Answer: Performance advantage of composite index in traditional RDBMS is severely impaired!
B-tree indexing techniques do not allow for flexibility in the use of dimensions for query purposes.
Bit indexing (and variations thereof) often allows much more generality in achieving high performance from a star schema.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 40
Star Schema for High Performance
Bottom Line:
It is not at all unusual to obtain an order of magnitude (or more) in performance advantage using a star schema with advanced indexing versus a more traditional relational database implementation.
Despite what vendors may tell you, star schemas cannot be effectively implemented for all DSS business applications and/or data models.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 41
ROLAP
Relational OLAP often makes heavy use of summary tables to provide near instantaneous access for multi-dimensional queries.
Foundation is usually star schema or snowflake database design.
Allows OLAP with much larger data sets than multi-dimensional database (MDD) products using cube structures (MOLAP).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 42
ROLAP
Number of summary tables can get very large if discipline is not enforced...
Assume a retail database with the following two dimensions on the fact table...
Calendar: Day, Week, Period, Quarter, Year, All Days
Geography: Store, Zone, District, Region, All Stores
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 43
ROLAP
All Days 13 19 24 28 30
Year 9 15 22 27 29
Quarter 6 11 18 23 26
Period 4 8 14 21 25
Week 2 5 10 17 20
Day 1 3 7 12 16
Store Zone District Region All Stores
Summary tables in a naive implementation require all combinations of the dimensions at each aggregation level...
30 summary tables! ... Add in item, SKU, subcategory, category, and all items...now we are up to 150 pre-aggregates!
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 44
ROLAP
Summary tables are more of a maintenance issue than a storage issue in most production implementations.
Notice that summary tables get much smaller as dimensions get less detailed (e.g., year vs. day).
Should plan for double the size of the unsummarized data for ROLAP summaries in most environments.
Every detail record that is received into warehouse must aggregate into EVERY summary table (assuming "to-date" summaries are maintained).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 45
ROLAP
Warning: Do not assume that dimensions are always simple hierarchies.
Example: Items are not just category, subcategory, SKU, and atomic item.... what about trade styles or manufacturer?
Now we need summary tables along these lines as well...another 120 summary tables!
Calendar vs. accounting period vs. billing cycle can be even worse...
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 46
ROLAP
Many ROLAP products have devised ways to reduce the number of summary tables:
Ability to build summaries on-the-fly as demanded by end-user applications.
Ability to aggregate efficiently from subset of the summary tables.
Tools exist in some products to assist in DBAs in selecting the "best” aggregations to build.
HOLAP (Hybrid OLAP) tools allow co-existence of pre-built cubes alongside relational OLAP structures.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 47
Intelligent Aggregation Selection Maximum performance boost implies lots of disk
for every pre-calculation. Minimum performance boost implies no disk with
zero pre-calculation. Strategy is to use meta data to heuristically
determine optimum set of aggregates from which all other aggregates can be derived.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 49
Fact Table Aggregates
Enhance performance on common queries at coarser granularities.
Save space to permit storing more history than possible with finer granularities.
Take advantage of need to store other facts (with similar samples) at a particular granularity.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 50
Aggregate Advice
Coarser granularity decreases potential cardinality, but usually increases density (e.g., daily summary table is typically twice the size of weekly summary table - not seven times).
Strongly consider omitting candidate aggregates where expected cardinality is more than 10% that of next finer granularity stored.
Keep the detail for drill down, even if you deploy aggregates for performance.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 51
Bottom Line
There are many implementation techniques for delivery of an OLAP environment.
Must fully consider the performance, scalability, complexity, and flexibility characteristics when deciding between MOLAP, ROLAP, and HOLAP.
Understand your tools and RDBMS!
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 53
Implementation Issues
Data design and preparation Administration and performance OLAP platforms OLAP tools and products
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 54
Data Design and Preparation
Characteristics of data Stores and uses much less data compared to a DW Data is summarized. You will rarely find data at the lowest
levels of detail as in the DW Data is more flexible for processing and analysis partly
because there is much less data to work with Every instance of the OLAP system in your environment is
customized for the purpose that instance serves
OLAP data is generally customized Types and levels
Static and dynamic summary data Permanent and transient detailed data
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 55
Administration
Administering and managing OLAP systems should be handled with that of the DW environment
Some considerations Expectations on what data would be accessed and how Selection of the right business dimensions Selection of the right filters in loading data from the DW Choosing the aggregation, summarization, and precalculation Size of the multidimensional database Access and security privileges Backup and restore facilities Drill-through to the data warehouse; drill-through to another
OLAP instance
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 56
Performance
OLAP takes most of the queries that normalling would run against the DW
OLAP is designed for complex queries; so it should enhance overall query performance of the DW environment
OLAP can precalculate and pre-aggregate data for quick response
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 57
OLAP Platforms
Usually, the data warehouse and OLAP systems reside on the same platform in the start. Later when the data warehouse becmes large and OLAP is a common task, OLAP system is moved to another platform
A separate platform is needed when The size and usage of the DW consumes all resources Many departmental users desire OLAP capabilities The stability and performance of OLAP degrades OLAP tools require a different platform configuration than
the DW