designing high performance datawarehouse

Designing High Performance Datawarehouse

Welcome to the webinar on

Presented by

&

Contents

1 What happened in the Data 1.0 World

2 What is shaping the new Data 2.0 World

3 Designing High Performance Datawarehouse

4 Q & A

What happened in the Data 1.0 World?

Before 2000 2000s Now

Do we need a DWH?

Advent of ODS

Data Silos

Metrics for success?

OLAP = Insights

Painful Implementations

Select success : top down & bottom up

We’ve got BI / DWH Tools

Performance vs. Volume : Game Changer

Drill-down Reporting from DWH – getting into mainstream

Standardized KPIs

Analytics as differentiator?

Retaining skills and expertise

Business led

Volume | Variety | Velocity | Value

Need insights from non-structured data as well

Analytics is a differentiator

Show me the ROI

(DATA) Big, Real time, In-memory – what do with existing initiatives?

Data 2.0 : scale, performance, knowledge, relevance

TDWI research based on 278 respondents – Top Responses`

42% sayCan’t scale to big data volumes

27% sayInadequate data load speed

27% sayPoor query response

Existing DW modeled for reports & OLAP only

Can’t score analytic models Fast enough

Cost of scaling up or out is too expensive

Can’t support high Concurrent user count

25%

24%

24%

23%

Inadequate support forIn-memory processing

19%

Current platform needs greatManual effort for performance

Poorly suited to real-time workloads

Can’t support in-database analytics

Poor CPU speed and capacity

18%

18%

15%

15%

Current platform is a legacy,We must phase it out

9%

Challenges in current DW environment - Survey

High PerformanceData Warehouse

Concurrency Enabled

Able to handle Complexity

Ability to Scale

Speed

Social Media Data

Text Data

Sensor Data

Syndicated Data

Numeric Data

True Sentiment

Faster Compliance

Faster Reach

Big Data Analytics

Analytics = Competitive Advantage

Efficiencies driving down costs

Customer experience & service

Every 18 months, non-rich structured and unstructured enterprise data doubles.

Data 2.0 World

Business is now equipped to consume, identify and act upon this data for superior insights

So what is a High Performance Datawarehouse?

Key Dimensions

SPEED

COMPLEXITY

CONCURRENCY

SCALE

HIGH PERFORMANCE

DATA WAREHOUSE

SPEED

COMPLEXITY

CONCURRENCY

SCALE

Competing Workloads – OLAP, Analytics Intraday data loads Thousands of users Ad hoc queries

Streaming Big Data Event Processing Real time operation

Operational BI Near time Analytics Dashboard Refresh

Fast Queries

Big Data variety Unstructured Sensor Social media

Many sources / targets Complex models and SQL High availability

Big Data volumes Detailed source data Thousands of reports Scale out into: cloud, clusters, grids, etc.

High Performance

Data Warehouse

Designing High Performance Datawarehouse

TDWI research based on 329 responses from 114 respondents

45% sayCreating Summary Tables

44% sayAdding Indexes

33% sayAltering SQL Statements or routines

Changing physical data models

Using in-memory databases

Upgrading Hardware

Moving an application to a separate data mart

24%

24%

21%

20%

Applying workload to management controls

16%

Choosing between column-row oriented data storage

Restricting or throttling user queries

Shifting some workloads to off-peak hours

Adjusting system parameters

16%

16%

15%

10%

Others

6%

Industry recognized top techniques

Designing Summary Tables

45% sayCreating Summary Tables

Summary table design process

COLLECTA good sampling of queries. These may come from user interviews, testing / QA queries,

production queries, reports or any other means that provide a good representation of

expected production queries

ANALYZE The dimension hierarchy levels, dimension attributes, and fact table measures that are

required by each query or report.

IDENTIFY The row counts associated with each dimension level represented.

BALANCEThe most commonly required dimension levels against the number of rows in the resulting

summary tables. A goal should be to design summary tables that are roughly 1/100 th the size

of the source fact tables in terms of rows (or less)

MINIMIZEThe columns that are carried in the summary table in favor of joining back to the dimension

table. The larger the summary table, the less performance advantages it provides.

Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next.

Capturing requirements for Summary table

•Choosing Aggregates to Create - There are two basic pieces of information which are required to select the appropriate aggregates.• Expected usage patterns of the data. • Data volumes and distributions in the fact table

Report Dimension Level MeasuresStore Item Date Sales

Report 1 District Calendar Year Sale_Amt

Sales_QtyReport 2 District Category Calendar Year

Calendar Month

Sale_Amt

Sales_QtyReport 3 District Calendar Year

Calendar Month

Sale_Amt

Sales_QtyReport 4 District Fiscal Period Sale_AmtReport 5 Store Dept Fiscal Week Sales_QtyReport 6 Dept Fiscal Period Sale_AmtReport 7 District Fiscal Week Sale_Amt

Sales_QtyReport 8 District Fiscal Week Sale_Amt

Sales_QtyReport 9 District Dept Fiscal Quarter Sale_AmtReport 10 District Fiscal Period Sales_QtyReport 11 Region Category Fiscal Week

Dimension Level # Populated of Members

Store Geography Division 1Region 3District 50Store 3980

Item Category Subject 279Category 1987Department 4145

Date Fiscal Year 3 Fiscal Quarter 12Fiscal Period 36Fiscal Week 156

Summary table design considerations

Semi-additive and all non-additive fact data

– need not be stored in the summary table Add as many “pre calculated” columns as possible “Count” columns could be added for non additive

facts to preserve a portion of the information

Aggregate storage column selection

A combined table containing basic level fact

rows and aggregate rows A single aggregate table which holds all

aggregate data for a single base fact table A separate table for each aggregate created

– Most preferred option

Storing Aggregate Rows

Efficient for aggregation programs to update the

aggregate tables with the newly loaded data Regeneration more appropriate if there is a lot of

program logic to determine what data must be

updated in the aggregate table

Recreating vs. Updating Aggregates

Multiple hierarchies in a single dimension Store all of the aggregate dimension records

together in a single table Use a separate table for each level in the

dimension Add dimension data to aggregate fact table

Storing Aggregate Dimension Data

Efficient Indexing for Datawarehouse

44% sayAdding Indexes

Dimension table indexing• Create a non clustered, primary key on the surrogate key of

each dimension table

• A clustered index on the business key should be considered.• Enhance the query response when the business key is used

in the WHERE clause. • Help avoid lock escalation during ETL process

• For large type 2 SCDs, create a four-part non-clustered index : business key, record begin date, record end date and surrogate key

• Create non-clustered indexes on columns in the dimension that will be used for searching, sorting, or grouping,.

• If there’s a hierarchy in a dimension, such as Category- Sub Category-Product ID, then create index on Hierarchy

Index columns Index Type

EmployeeKey Non clustered

EmployeeNationalIDAlternateKey clustered

EmployeeNationalIDAlternateKey,StartDate, EndDateEmployeeKey

Non clustered

FirstNameLastNameDeoartmentName

Non clustered

Fact table indexing

• Create a clustered, composite index composed of each of the foreign keys to the fact tables

• Keep the most commonly queried date column as the leftmost column in the index

• There can be more than one date in the fact table but there is usually one date that is of the most interest to business users. A clustered index on this column has the effect of quickly segmenting the amount of data that must be evaluated for a given query

Index columns Index Type

OrderDateKeyProductKeyCustomerKeyPromotionKeyCurrencyKeySalesTerritoryKeyDueDateKey

clustered

Column Oriented databases

Row Store and Column Store

Most of the queries does not process all the attributes of a particular relation.

Row Store Column Store(+) Easy to add/modify a record (+) Only need to read in relevant data

(-) Might read in unnecessary data (-) Tuple writes require multiple accesses

• One can obtain the performance benefits of a column-store using a row-store by making some changes to the physical structure of the row store.– Vertically partitioning– Using index-only plans– Using materialized views

Vertical Partitioning• Process:

– Full Vertical partitioning of each relation• Each column =1 Physical table• This can be achieved by adding integer position column to every table• Adding integer position is better than adding primary key

– Join on Position for multi column fetch

Index-only plans

• Process:– Add B+Tree index for every Table.column– Plans never access the actual tuples on disk– Headers are not stored, so per tuple overhead is less

Using Hadoop for Datawarehouse

Hadoop ecosystem

Distributed Storage(HDFS)

Distributed Processing(MapReduce)

Metadata Management(Hcatlog)

Query(Pig)

Scripting(Pig)

Wor

kflow

& S

ched

ulin

g(O

ozie

)

Non

-Rel

ation

al D

atab

ase

(Hba

se)

Dat

a Ex

trac

tion

& L

oadi

ng(H

catlo

g AP

Is, W

ebH

DFS

,Ta

lend

Ope

n St

udio

for B

ig D

ata,

Sqo

op)

Man

agem

ent &

Mon

itorin

g (A

mba

ri, Z

ooke

eper

)

Ecosystem of open Source projects

Hosted by Apache Foundation

Google developed and shared concepts

Distributed File System that has the ability to scale out

Data Staging

Data archiving

Schema flexibility

Processing flexibility

Distributed DW architecture

Hadoop allows organizations to deploy an extremely scalable and economical ETL environment

Hadoop’s scalability and low cost enable organizations to keep all data forever in a readily accessible online environment

Hadoop can quickly and easily ingest any data format

Hadoop enables the growing practice of “late binding” – instead of transforming data as it’s ingested by Hadoop, structure is applied at runtime

Off load workloads for big data and advanced analytics to HDFS, discovery platforms and MapReduce

Promising uses of Hadoop in DW context

What led to Datawarehouse at Facebook

Data, data and more data

200 GB per day in

March 2008

2+ TB (compressed) per day

The Problem

Superior in availability, scalability

And Manageability compared

to commercial Databases

Uses Hadoop File System (HDFS)

The Hadoop Experiment

Programmability & Metadata

Map Reduce hard to program

Need to publish data in well

known schemas

Challenges with Hadoop

HIVE

Solution

What is Hive?What is Hive?

A system for managing and querying structured data built on top of Hadoop

Uses Map Reduce for execution

Uses HDFS for storage

Key Building PrinciplesKey Building Principles TablesTables

SQL on structured data as a familiar data warehousing tool

Pluggable map/reduce scripts in language of your choice: Rich Data Types

Performance

Each table has a corresponding directory in HDFS

Each table points to existing data directories in HDFS

Split data based on hash of a column – mainly for parallelism

Analytical platforms

Analytical platforms overview1010dataAster Data (Teradata)CalpontDatallegro (Microsoft)ExasolGreenplum (EMC)IBM SmartAnalyticsInfobrightKognitioNetezza (IBM)Oracle ExadataParaccelPervasiveSand TechnologySAP HANASybase IQ (SAP)TeradataVertica (HP)

Purpose-built database management systems designed explicitly for query processing and analysis that provides dramatically higher price/performance and availability compared to general purpose solutions.

Deployment Options-Software only (Paraccel, Vertica)-Appliance (SAP, Exadata, Netezza)-Hosted(1010data, Kognitio)

• Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations

• AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted marketing

Which platform do you choose?

Structured Semi-Structured Unstructured

Hadoop

Analytic Database

General Purpose RDBMS

Thank You

Please send your Feedback & Corporate Training /Consulting Services

requirements on BI to [email protected]