bicod-2017

12th July 2017 BICOD'[email protected] Kingdom 1

Taming Size and Cardinality of OLAP Data

Cubes over Big Data

Alfredo Cuzzocrea University of Trieste & ICAR

Rim Moussa LaTICE Lab. & University of Carthage

Achref Labidi LaTICE Lab. & University of Carthage

The 31st British International Conference on Databases

@ London, United Kingdom 12th of July, 2017


Outline

Data Warehouse Systems DWS Architectures OLAP cube DSS Benchmarks

TPC-H*d: a Multi-dimensional Database Benchmark TPC-H*d AutoMDB

Application Scenarios of TPC-H*d Benchmarking Data Servers Benchmarking Multidimensional DB Schemas Benchmarking Parallel OLAP Servers

Conclusion & Research Agenda


Data Warehouses Architectures: Lazy Data Integration Query-driven Architecture

Relational Data

Source

WRAPPER WRAPPER

MEDIATOR

WRAPPER WRAPPER


Data Warehouses Architectures: Eager Data Integration Warehouse System Architecture

Data Warehouse Relational Data

Source

Integration workflows of the

Integration System

12th July 2017 BICOD'[email protected] Kingdom

Facts: are the objects that represent the subject of the desired analyses. »Examples: sales records, weather records, cabs trips, … »The fact table contained 3 types of attributes: measured attributes, foreign keys

to dimension tables, degenerate dimensions Dimension(s): »Levels are individual values that make up dimensions »Examples

»Date dimension (Trimester, month, day) »Time dimension (hour, min, sec) »Geography dimension (Country, city, postal code)

Measure(s): »Examples: revenue, lost revenue, sold quantities, expenses, … »Use aggregate functions: min, max, count, distinct-count, sum, average, …

5

Data Warehousing --OLAP Cube


OLAP Operations Q10 of TPC-H benchmark

Customer nation details

order date year quarter

line return flag


OLAP Operations On-Line Analytical Processing (Q10 of TPC-H benchmark)


Structured Query Language (SQL)

»Relational and static schema »Data Definition, data Manipulation, and Data Control Language »Analytic Functions (window functions over partition by …) »Cube, roll-up and grouping sets operators

MultiDimensional eXpressions (MDX)

»Invented by Microsoft in 1997 »For querying and manipulating the multidimensional data stored in OLAP cubes »Static schema

Data Flow programming language

»Google Sawzall, Apache Pig Latin, IBM Infosphere Streams »Dynamic schema »After data is loaded, multiple operators are applied on that data before the final

output is stored.

8

Query Languages

Load Data Apply

Schema Apply Filter Group Data

Apply Aggregate Function

Sort Data Store

Output


Query Languages SQL – Q16 of TPC-H benchmark


Query Languages MDX –Q16 of TPC-H Benchmark

WITH SET [Brands] AS 'Except({[Part Brand].Members}, {[Part

Brand].[Brand#45 ]})'

SET [Types] AS 'Filter({[Part Type].Members}, (NOT ([Part

Type].CurrentMember.Name MATCHES "(?i)MEDIUM POLISHED.*")))'

SET [Sizes] AS 'Filter({[Part Size].Members}, ([Part Size].CurrentMember IN

{[Part Size].[3], [Part Size].[9], [Part Size].[14], [Part Size].[19], [Part Size].[23],

[Part Size].[36], [Part Size].[45], [Part Size].[49]}))'

SELECT [Measures].[Supplier Count] ON COLUMNS,

nonemptyCrossjoin(nonemptyCrossjoin([Brands], [Types]), [Sizes]) ON ROWS

FROM [Cube16]


Query Languages Data Flow –Pig Latin script for Q16 of TPC-H benchmark


Decision Support Systems Benchmarks

Non-TPC Benchmarks Real datasets

»Open data or proprietary data »fixed size »Devise a workload or trace the proprietary workload

APB-1: no scale factor TPC Benchmarks

The Transaction Processing Council founded in 1988 to define benchmarks

In 2009, TPC-TC is set up as an International Technology Conference Series on Performance Evaluation and Benchmarking

Examples of benchmarks relevant for benchmarking decision support systems: TPC-H, TPC-DS and TPC-DI Common characteristics of TPC benchmarks

»Synthetic data »Scale factor allowing generation of different volumes 1GB to 1PB


Decision Support Systems Benchmarks TPC-H Benchmark Schema (1/2) TPC-H Benchmark 22 ad-hoc SQL statements (star queries, nested queries, …) + refresh functions


Decision Support Systems Benchmarks TPC-H Benchmark (2/2)

TPC-H Benchmark 2 Metrics »QphH@Size is the number of queries processed per hour, that the system

under test can handle for a fixed load »$/QphH@Size represents the ratio of cost to performance, where the cost is

the cost of ownership of the SUT (hardware,software, maintenance). Variants of TPC-H Benchmarks

TPC-H*d Benchmark [Cuzzocrea and Moussa, 2013] »Turning TPC-H benchmark into a Multi-dimensional benchmark »Few schema changes »Same TPC-H workload »2 MDX workloads: query workload cube-then-query workload

SSB: Star Schema Benchmark [O’Neil et al., 2012] »Turning TPC-H benchmark into star-schema »Workload composed of 12 queries

TPC-H translated into Pig Latin (Apache Hadoop Ecosystem) [Moussa,2012] »22 pig latin scripts which load and process TPC-H raw data files (.tbl files)


Decision Support Systems Benchmarks TPC-DS Benchmark (1/2)

TPC-DS Benchmark: 7 data marts


Decision Support Systems Benchmarks TPC-DS Benchmark (2/2)

TPC-DS Benchmark Workload Hundred of queries (99 query templates) OLAP, windowing functions, mining, and reporting queries ACID and Concurrent data maintenance (not ACID in TPC-DS 2.x)

TPC-DS Benchmark Metrics 2 main Metrics

»QphDS@Size is the number of queries processed per hour, that the system under test can handle for a fixed load.

»Data Maintenance and Load Time are calculated »$/QphDS@Size represents the ratio of cost to performance, where the

cost is a 3 year cost of ownership of the SUT (hardware,software, maintenance)

TPC-DS implementations TPC-DS v2.0

»Extension for non-relational systems such as Hadoop/Spark big data systems


Outline

Introduction

Part I: Data Warehouses

Part II: Muti-dimensional Database Design

TPC-H*d AutoMDB

Part III: Application Scenarios

Conclusion


Given,

A relational Warehouse schema

A Workload -a set of OLAP business queries,

W = {Q1, Q2, …, Qn}

where Qi is a parameterized query

How to design the Multi-dimensional DB Schema?

How to define cubes?

Will there be a single cube or multiple cubes?

Are there any rules for merging of cubes?

Which Optimizations are suitable for performance tuning ?

Derived data calculus & refresh? (materialized views, derived attributes,

indexes,…)

Data partitioning & parallel cube building?

# 18

MDB Design Problem

12th July 2017 BICOD'[email protected] Kingdom # 19

Idea

Map each business query to an OLAP cube

>> Obtain a multi-dimensional DB schema

Recommend & Test Optimizations

>> Derived Data

>> Data partitioning

>> Cube Merging


TPC-H*d Q8: From SQL statement to OLAP cube


TPC-H*d TPC-H*d OLAP Cube C8

Market Share for each supplier nation within a region of customers, for each year and each part type


TPC-H*d TPC-H*d OLAP Query Q8

Market Share for each RUSSIAN Suppliers within AMERICA region, Over the years 1995 and 1996 and for part type ECO. ANODIZED STEEL


Open source software implemented in java

Parses MDB schemas (.xml) files using SAX Library

Performs comparisons of OLAP cubes' characteristics.

»For each pair of OLAP cubes, »show whether they have same fact table or not »compute the nbr of shared | different | coalescable dimensions

»Dimensions are coalescable if they are extracted from the dimension table and their hierarchies are coalescable

»compute the number of shared | different measures »Run merge of OLAP cubes using different similarity functions

»Simple distance function have or not same fact table »K-means clustering

»Distance function is computed with weights to cube characteristics

»Propose Virtual Cubes »Auto-generate a new MDB Schema (.xml)

»Create MDB Schema from TPC-DS SQL Workload »On-going

# 23

AutoMDB


AutoMDB Load OLAP Cubes defined in xml file


AutoMDB Compare OLAP Cubes –have or not same fact table


AutoMDB Compare Cubes –Group cubes which have same fact table


AutoMDB Compare Cubes –Auto-generate a new MDB schema


Outline

Introduction

Part I: Data warehousing

Part II: Multidimensional DB Design

Part III: Application Scenarios Benchmarking Data Servers Benchmarking Multidimensional DB Schemas Benchmarking Parallel OLAP servers

Conclusion and Research agenda


Benchmarking Data Servers --Column-oriented storage systems vs row-oriented storage systems Columnar Storage Systems

»High IO performance: less data moving from hard drives to memory »Efficient Memory Management: load only required data into memory »Reduced Storage: columns with low cardinality are compressed »Efficient Schema Modifying Techniques: adding new columns will not

induce a file storage re-organization Types

»Binary Association Tables »Each column is stored in a separate (surrogate key, value) table »RDBMS: MonetDB

»Family of columns »Design techniques are based on measuring the affinity between

attributes through the count of their co-occurrence in the query workload and clustering attributes

»Vertical partitioning for DB design


Benchmarking Data Servers --Column-storage systems vs row-based storage systems

MySQL MonetDB

C1 2,778 sec 30 sec

C10 Java heap space Error 758 sec

C11 2,558 sec 2,536 sec

C3 Mondrian Error: Size of cross join exceeded limit


Benchmarking Middleware for Parallel Cube Processing --OLAP & High Performance Computing

Systems which scale-out through Data Fragmentation and Load Balancing

achieve »Parallel IO »Parallel Processing

Technologies »Parallel Cube processing OLAP servers

»Distributed Relational Data Warehouses + Mid-tier for parallel cube processing

»Hadoop Systems »SQL-on-Hadoop Systems

»e.g. Hive, Spark SQL, Drill, Impala, IBM BigInsights, …


Benchmarking Middleware for Parallel Cube Processing --OLAP* framework Key Considerations for Data Fragmentation

Reduce the Size of Each Cube to be Built at Each Node

»big-cardinality dimensions' partitioning

Simplify Post-Processing of OLAP Cubes »Cubes which have disjoint dimensions’ members have simple post-

processing (union all operation), while the merge of all dimensions' hierarchies is costly

Enhance Data Maintenance »DW refresh processing »Distributed Maintenance Transaction processing

Controlled Replication »Replication has refresh and storage cost »Replication optimizes join operations through dimension table

replication


Benchmarking Middleware for Parallel Cube Processing --Performance Measurements with MySQL as DB backend

MySQL 4 MySQL instances DB

C1 2,778 sec 862 sec

C10 Java heap space Error 13,774 sec


Benchmarking MDB Schemas

MDB Design

»Simple approach: Map for each query a required cube(s) »Sophisticated approach

»Analyze OLAP workload »Find out shared facts, dimensions and measures »Define new cubes based on cubes clustering »Re-write the workload


Benchmarking MDB Schemas --TPC-H*d Example

_Same fact table _2 shared dimension tables but different hierarchies _1 different dimension _Same measure


Benchmarking MDB Schemas --TPC-H*d Example

Initial schema Virtual Cubes

C_5_7 - 3,457 sec

C5 3,200 sec 0.7 sec

C7 617 sec 0.2 sec


Conclusion and Future Work

Performance Leaks Mondrian can not build an OLAP cube having more than

2,147,483,647 cells OLAP cube 20 has 200,052,100,026 cells

Experiments TPC-H with SF=10 RDBMS: MonetDB and MySQL Tuning: materialized views and derived attributes Were run on Suno nodes (@Sophia Grid5000 HPC platform) Each node has 32GB of RAM

Mondrian requires more RAM XML description of Cubes of TPC-H and TPC-DS cubes allows us to sketch,

recommend and assess vertical partitioning techniques for DB design (Family of columns) materialized views indexes


Future Work

Intelligent Recommenders for the selection of Indexes and Materialized Views

Indexes and physical structures that can significantly accelerate performance XML description of each cube allows us to recommend

Recommenders for performance tuning »AutoAdmin research project at Microsoft, which explores techniques

to make databases self-tuning [Agrawal et al., 2000] »Alerter Approach [Hose et al., 2008]: support the aggregate

configuration of an OLAP server by (1) continuously monitoring information about the workload and the benefit of aggregation tables and (2) alerting the DBA if changes to the current configuration would be beneficial

»Semi-Automatic Index Tuning: keeping DBAs in the loop [Schnaiter and Polyzotis, 2012] Online workload analysis with decisions delegated to the DBA. The solution takes into account index interactions


Research in Data Warehouse Modeling?

DOLAP Workshop 2006

IBM White paper 2015


References (1/3) M. Fricke, The Knowledge Pyramid: A Critique of the DIKW Hierarchy. Journal of Information

Science. 2009. E.F. Codd, S.B. Codd and C.T. Salley, Providing OLAP to User Analysts: an IT mandate, 1993. J. Widom, Integrating Heterogeneous databases: eager or lazy? ACM Computing Surveys (CSUR)

Vol.4, 1996 Y.R. Cho, Data Warehouse and OLAP Operations www.ecs.baylor.edu/faculty/cho/4352 TPC homepage http://www.tpc.org/ M. Poess, T. Rabl and B. Caufield: TPC-DI: The First Industry Benchmark for Data

Integration. PVLDB 7(13): 1367-1378 (2014) http://www.vldb.org/pvldb/vol7/p1367-poess.pdf X. Li, J. Han, H. Gonzalez: High-Dimensional OLAP: A Minimal Cubing Approach. VLDB 2004. C. Imhoff, N. Galemmo, J. G. Geiger. Mastering Data Warehouse Design: Relational and

Dimensional Techniques. 2003.

R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, B. Becker. The Data Warehouse

Lifecycle Toolkit. 2nd Edition.

R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional

Modeling. 2nd Edition. H. G. Molia. Data Warehousing Overview: Issues, Terminology, Products.

www.cs.uh.edu/~ceick/6340/dw-olap.ppt (slides)

http://www.minet.uni-jena.de/dbis/lehre/ss2005/sem_dwh/lit/Cod93.pdf
















http://www.ecs.baylor.edu/faculty/cho/4352








http://www.tpc.org/

http://www.tpc.org/

http://www.tpc.org/

http://www.tpc.org/

http://www.tpc.org/

http://www.tpc.org/

http://www.tpc.org/

http://www.vldb.org/pvldb/vol7/p1367-poess.pdf














http://www.cs.uh.edu/~ceick/6340/dw-olap.ppt















References (2/3) Modeling Multidimensional Databases (non exhaustive list)

M. Gyssens and L. V.S. Lakshmanan. A Foundation for Multi-Dimensional Databases. VLDB’1997.

R. Agrawal, A. Gupta and S. Sarawagi. Modeling Multidimensional Databases. ICDE’1997.

J. Gray, A. Bosworth, A. Layman and H. Priahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. ICDE’2008.

P. Vassiliadis. Modeling Multidimensional Databases, Cubes and Cube Operations. SSDBM’1998.

L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases. EDBT’1998.

D. Cheung, B. Zhou, B. Kao, H. Lu, T. Lam and H. Ting. Requirement-based data cube schema design. CIKM’1999.

T. Niemi, J. Nummenmaa and P. Thanisch. Constructing OLAP cubes based on Queries. DOLAP’2001.

O. Teste. Towards Conceptual Multidimensional Design in Decision Support Systems. DEXA’2010.

A. Cuzzocrea and R. Moussa. Multidimensional Database Design via Schema Transformation: Turning TPC-H into the TPC-H*d Multidimensional Benchmark. COMAD’2013.


References (3/3)

Introduction

Part I: Methods & State-of-the-Art

Part II: Experiences

Part III: Challenging Problems

Conclusion

M. Fowler, Schemaless data structures. 2013 http://martinfowler.com/articles/schemaless/ N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data

systems, 1st Edition S. Agrawal, S. Chaudhuri and V. Narasayya Automated Selection of Materialized Views and

Indexes for SQL Databases. VLDB’2000 http://www.research.microsoft.com/dmx/AutoAdmin K. Hose, D. Klan, M. Marx and K. Sattler. When is it Time to Rethink the Aggregate

Configuration of Your OLAP Server?. VLDB’2008 Karl Schnaitter and Neoklis Polyzotis. Semi-Automatic Index Tuning: Keeping DBAs in the

Loop. VLDB’2012 P. Zhao, X. Li, D. Xin and J. Han.

Graph cube: on warehousing and OLAP multidimensional networks. SIGMOD’2011 L. D. Lins, J. T. Klosowski and C. E. Scheidegger:

Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput. Graph. 2013 https://github.com/laurolins/nanocube

http://www.research.microsoft.com/dmx/AutoAdmin









https://github.com/laurolins/nanocube






Thank you for your Attention

Q & A

Taming Size and Cardinality of OLAP Data Cubes over Big

Data

Alfredo Cuzzocrea, Rim Moussa and Achref Labidi

12th of July, 2017


Decision Support Systems Benchmarks TPC-DI Benchmark (1/3) [Poess et al. 2014] For benchmarking Data Integration technologies Synthetic Data of a Factious Retail Brokerage Firm

»Internal Trading system data, Internal Human resources data, Internal CRM System and External data

»Different data scales »Data extracted from different sources:

»Structured (csv) »Semi-structured data (xml) »Multi record (nested data) »Change Data Capture (CDS)

18 Complex Data Integration Tasks Load large volumes of historical data Load incremental updates Execute complex transformations Check and ensure consistency of data


TPC-H*d

Truly OLAP variant of TPC-H benchmark

TPC-H SQL workload translated into MDX (MultiDimensional

eXpressions)

The workload is composed of 23 MDX statements for OLAP

cubes and 23 MDX statements for OLAP business queries. Each business question of TPC-H benchmark is mapped to an OLAP

cube

bicod-2017

Education