olap tuning. outline olap 101 – data warehouse architecture – rolap, molap and holap data cube...

OLAP Tuning

Outline

• OLAP 101– Data warehouse architecture– ROLAP, MOLAP and HOLAP

• Data Cube– Star Schema and operations– The CUBE operator– Tuning the cube

• Data Mining 101

@ Dennis Shasha and Philippe Bonnet, 2013

OLAP

• Online Analytical Processing– OLAP enables a user to interactively and selectively

extract and view data from different points-of-view.– Typical OLAP queries

• Find sales for seniors in Copenhagen (selection)• Find sales per age group, per city (aggregation)• Find sales per age group, per country (aggregation)• Find total sales (aggregation)• Find sales for seniors, per country (selection, aggregation)

Selections & Aggregations on Multi-dimensional Data

Data Warehouse

Data Warehouse

(DW)

ProductionData

ProductionData

ProductionData

DataMart

DataMart

DataMart

DataMart

Transactional Processing Analytical Processing

Infrequent updatesData in DW MUST NOT be latest, up to date version

ROLAP, MOLAP, HOLAP

• MOLAP– DW is a proprietary system, tailored for multi-

dimensional data manipulations• Relational OLAP– Multi-dimensional data mapped onto tables, and

manipulations mapped onto relational queries• Hybrid OLAP– Relational systems extended with specific OLAP

functionalities

Star Schema

• Fact tableSales(Product_Id, Time_Id, City_id, Amount)• Multi-dimensional data• Can be represented as a (hyper-)cube

– 3 dimensions: Product, Time, City– The cube contains Amount values

• DimensionsProduct(Product_id, Name, Category, Price)Space(City_id, City, Country, Region)Time(Time_id, Week, Month, Quarter)• Typically organized in a hierarchy

Drill Down and Roll Up

• Dimensions as aggregation hierarchy• Drill down

– Series of queries that moves down the aggregation hierarchy– E.g., per region, per country, per city

• Roll up– Series of queries that moves up the aggregation hierarchy– E.g., per week, per month, per year

• Same form of SQL query, different attributes– When rolling up, query results can be re-used

• An aggregation can be used as a basis for an aggregation one or more levels up in the hierarchy

Pivoting

• Data as a cube which is pivoted so that a user can “see” its various faces– Pivoting on dimensions D1, D2, D3 means

grouping by attributes from these dimensions– New pivot on D3, D2, D1• Interesting in case a visualization software is used to

represent 3 dimensions as x, y, z in space• Interesting if there are N dimensions, and the pivot

concerns a subset of these dimensions

Slicing and Dicing

• Slice– A value is given for a dimension attribute in the

where clause– We take a “slice” of the cube

• Dice– Multiple values (or a range) are given for a

dimension attribute in the where clause– We are dicing, i.e., reduce the size of, the original

cube

Star Schema Operations• Write the following sequence of queries in SQL on the sales star-

schema– Original cube:

• Sales amount per country, per week, per category

– Roll-up on time• Sales amount per country, per month, per category• Sales amount per country, per year, per category

– Drill-down on city• Sales amount per city, per year, per category

– Pivot on product, time and space• Sales amount per category, per year, per city

– Slice on year 2012• Sales amount per category, per city for 2012

– Dice on the last three years• Sales amount per category, per city for 2010,2011,2012

Star Schema Operations

• What are the SQL queries you need to construct the following table

Product 1 Product 2 Product 3 Total

City 1 520 230 100 850

City 2 10 15 10 35

City 3 1000 1200 1000 3200

Total 1530 1445 1110 4085

The CUBE Operator

SELECT city_id, product_id, SUM(amount) as sum_aFROM SALESGROUP BY CUBE (city_id, product_id)

• Defined by Jim Gray et al. in 1996• Part of the SQL standard• Supported in Oracle, DB2, SQL Server• ROLAP implementation

City_id Product_id Sum_a

City1 Product 1 520

City1 product2 230

City1 product3 100

City1 ALL 850

City2 Product 1 10

City2 product2 15

City2 product3 10

City2 ALL 35

City3 Product 1 1000

City3 product2 1200

City3 product3 1000

City3 ALL 3200

ALL Product 1 1530

ALL product2 1445

ALL product3 1110

ALL ALL 4085

http://paul.rutgers.edu/~aminabdu/cs541/cube_op.pdf

http://www.dba-oracle.com/t_cube.htm

http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=/com.ibm.db2.luw.sql.ref.doc/doc/r0000761.html&resultof=%22group%22%20%22cube%22

http://msdn.microsoft.com/en-us/library/bb522495(v=sql.105).aspx

http://www.madgik.di.uoa.gr/sites/default/files/acm_csur_v39.4.pp12.1-12.53.pdf

The ROLLUP Operator

SELECT city_id, product_id, SUM(amount) as sum_aFROM SALESGROUP BY city_id, product_id

with ROLLUP

• Part of the SQL standard• Supported in Oracle, DB2,

SQL Server, MySQL

City_id Product_id Sum_a

City1 Product 1 520

City1 product2 230

City1 product3 100

City1 ALL 850

City2 Product 1 10

City2 product2 15

City2 product3 10

City2 ALL 35

City3 Product 1 1000

City3 product2 1200

City3 product3 1000

City3 ALL 3200

ALL ALL 4085

http://www.oracle-base.com/articles/misc/rollup-cube-grouping-functions-and-grouping-sets.php

http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=/com.ibm.db2.luw.sql.ref.doc/doc/r0059215.html

http://msdn.microsoft.com/en-us/library/ms189305(v=sql.90).aspx

http://dev.mysql.com/doc/refman/5.0/en/group-by-modifiers.html

Tuning the Cube• Materialized Views

– To materialize the original cube and the result of important cube manipulations (those that are re-used often)

• Indexes– Speeding up foreign-key/primary-key joins

• Dimensions as index-organized tables (clustering index on primary key)• Non-clustered index on foreign key in fact table

– Indexing low-cardinality attributes• Bitmap index (Oracle)

– SQL Server columnstore indexes• Compression

– Speeding up scans, reducing DW footprint on 2nd storage• Column-Oriented Representation

– Great for slicing, dicing– Great for compression– Great for leveraging RAM, Processor cache

• Parallelism– Work on dimensions in parallel, Speeding up scans– Tuning degree of parallelism in ORACLE, DB2

http://www.dba-oracle.com/oracle_tips_bitmapped_indexes.htm

http://msdn.microsoft.com/en-us/library/gg492088.aspx



http://docs.oracle.com/cd/B28359_01/server.111/b28313/usingpe.htm

http://www.scs.carleton.ca/research/tech_reports/2007/download/TR-07-02.pdf

Data Warehouse AppliancesExadata Data Sheet

Normal Table Scan vs. Exadata Smart ScanSee B.Durrett’s slides

http://www.oracle.com/technetwork/server-storage/engineered-systems/exadata/exadata-dbmachine-x3-8-ds-1855388.pdf?ssSourceSiteId=ocomen

http://www.oracle.com/technetwork/server-storage/engineered-systems/exadata/exadata-dbmachine-x3-8-ds-1855388.pdf?ssSourceSiteId=ocomen

http://www.bobbydurrettdba.com/2013/03/28/yet-another-exadata-slides-update/



Column Stores

• Columnar representation – Compression & Scan efficiency– Tailored for RAM, processor cache utilization

• VectorWise• SQL Server ColumnStore indexes

http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf

Data Mining 101

• Boundaries of Data Management, Statistics and Machine Learning

• Finding Patterns in Large Data Sets– Associations

• Many buy Product1 and Product3 together

– Classification• Given some predefined classes (e.g.., StaysInBusiness,

GoesOutOfBusiness) train a classifier to distinguish in which class a store belongs based on its sales records

• Might be used for prediction

– Clustering• Like classfication but the classes are not given beforehand. They are

discovered by the clustering algorithm.

Associations

• An association is a correlation between values in the same or different columns– Noted Predicate1 => Predicate2– Example: Purchases_Diaper => Purchases_Beer

• Confidence and Support– Confidence (rule): percentage of where Predicate2 is true

when Predicate1 is true– Support (itemset): percentage of records where all attribute

values needed by the rule are present• Confidence and support must be over a given threshold

so that an association holds

Classification

• Decision tree, Neural networks– Multiple variables analysis– Learning algorithm (training set)

• Example: Titanic survivors Sex == M

Age > 9.5

Number of Siblings > 2.5

T FSurvived (36%)

T

T

F

FSurvived (2%)

Died (61%)

Died (2%)

https://www.kaggle.com/c/titanic-gettingStarted/data

Clustering

• K-means algorithm1. Each (of the given K) cluster is given a centroid2. Form clusters by assigning points to cluster with

closest centroid (distance is defined)3. Recompute cluster centroid4. Repeat 2,3 until centroids do not move

• Other techniques– Hierarchical clustering (e.g., BIRCH), Support

Vectors

Tuning for Mining

• Tuning Scans– Most algorithms require several passes over the

data– Parallelism & compression

• Statistics features in systems– Predictive Analytics in SQL Server– Data mining features of DB2– Oracle DataMiner

http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/predictive-analytics.aspx

http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=/com.ibm.datatools.datamining.doc/c_dp_Features.html&resultof=%22data%22%20%22mining%22%20%22mine%22%20%22features%22%20%22featur%22

http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html

http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html

olap tuning. outline olap 101 – data warehouse architecture – rolap, molap and holap data cube...

Documents