data warehouse modernization - 1105...

24
Data Warehouse Modernization Benjamin Smith Vertica Product Marketing [email protected]

Upload: dangnhi

Post on 15-Mar-2018

227 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Data Warehouse Modernization Benjamin Smith Vertica Product Marketing [email protected]

Page 2: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Agenda

Vertica Overview

Advanced Analytics

Hadoop Integration

Case Studies

Page 3: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Vertica Overview

3

Page 4: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Foundations of Vertica

4

Columnar storage Compression MPP scale-out Distributed query Projections

Speeds query time by reading only

necessary data

Lowers costly I/O to boost overall performance

Provides high scalability on clusters with no name node or other single point of

failure

Any node can initiate the queries and use other

nodes for work. No single point of failure

Combine high availability with special optimizations for query

performance

A B D C E A

Memory

CPU

Disk

Page 5: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

HPE Vertica analytics platform

Fast Boost performance by

500% or more

Scalable Handles huge workloads

at high speeds

Standard No need to learn new

languages or add complexity

Costs Significantly lower cost over legacy platforms

5

Page 6: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

HPE Vertica Enterprise – Columnar storage and advanced compression – Maximum performance and scalability – Flex Tables for schema-on-read

HPE Vertica portfolio The broadest range of deployment and consumption models

Vertica unified platform – Cloud – On-Premise – On Hadoop

HPE Vertica for SQL on Apache® Hadoop – Native support for ORC, Parquet, more – Support for industry-leading distributions – No helper node or single point of failure

HPE Vertica in the Clouds – Deploy quickly on the choice of cloud platform – Support for AWS, Azure, VMware – Flexible, enterprise-class cloud deployment options

6

Page 7: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Vertica Partner Ecosystem Enabled in large part by Vertica’s strong support for the SQL standard

7

Data transformation

Advanced analytics

Cloud

Platform

BI / Visualization

HPE Secure Data

Page 8: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Advanced Analytics

8

Page 9: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Advanced In-database Analytics

Allows for: – Standard functionality

that performs at scale

SQL ‘99

Allows for: – Sessionization – Conversion analysis – Fraud detection – Fast Aggregates (LAP)

SQL Extensions

Allows for: – Machine learning – Custom data mining – Specialized parsers

SDKs

– Pattern matching – Event series joins – Time series – Event-based windows

– Aggregate – Analytical – Window functions – Graph – Monte Carlo – Statistical – Geospatial

Analytics – Java – C++ – R

Connection – ODBC/JDBC – HIVE – Hadoop – Flex zone

Allows for: – Statistical modeling – Cluster analysis – Predictive analytics

In-database Analytics

– Regression testing – K-means – Statistical modeling – Classification algorithms – Page rank – Text mining – Geospatial

9

Page 10: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Applied Machine Learning at Scale

Added Cost

– Additional hardware required for building predictive models

Requires Down Sampling

– Cannot process large data sets due to memory and computational limitations, resulting in inaccurate predictions

Slower Time to Development

– Higher turnaround times for model building/scoring and need for moving large volumes of data between systems

Slower Time to Deployment

– Inability to quickly deploy predictive models into production

10

Current Barriers to Machine Learning and Predictive Analytics

Confidential

Page 11: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

- Machine Learning algorithms, such as k-means and regression, built into the core of Vertica

- Advanced predictive modeling runs within the database eliminating all data duplication typically required of alternative vendor offerings

- Traditional approaches can’t handle many data points forcing data scientists to “down-sample” leading to less accurate predictions

- A single system for SQL analytics and Machine Learning

11

Node 1 Node 2…. Node n

Machine learning functions run in parallel across hundreds of nodes

in a Vertica cluster

Building Machine Learning into the Core of Vertica

Confidential

Page 12: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Algorithm Model Training Prediction Evaluation

Linear Regression

Logistic Regression

K-means

Naïve Bayes

Support Vector Machine (SVM)

In-Database ML Algorithms

Model Management

Summarize models

Rename models Delete models

Data Preparation Data

Normalization

Confidential 12

Page 13: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Hadoop Integration

13

Page 14: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

HPE Vertica for SQL on Apache Hadoop®

- Same Vertica MPP Columnar Architecture

- Base ANSI SQL - Co-Located with Hadoop - Data Query Across Parquet, ORC,

JSON, and many other format - Optimized stack for query speed

- Parallelized query planning and execution

- Vectorized data loading, asynchronous data access

- Hadoop Agnostic

Analytical Applications

Vertica Core Engine

SQL Python Java R

ROS, JSON Storage ORC, Parquet, AVRO Storage

Page 15: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Vertica’s Unique Value to Expand the Data Warehouse

15

Hadoop Data Lake Vertica Big Data Warehouse

CREATE TABLE customer_visits ( customer_id bigint, visit_num int) PARTITIONED BY (page_view_dt date) STORED AS ORC;

Customer information in Hadoop Customer information in Data Warehouse

Deeper understanding of customer SELECT customers.customer_id FROM orders RIGHT OUTER JOIN customers ON orders.customer_id = customers.customer_id GROUP BY customers.customer_id HAVING COUNT(orders.customer_id) = 0;

• Leveraging Web Logs to gain customer insight

• Sensor and IOT data for pre-emptive service

• Marketing Programs Tracking

HPE Vertica Engine Querying data that sits BOTH in the data warehouse and Hadoop is our unique value. Most solutions require that you move the data.

ROS

Page 16: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Proof is in the pudding

16

3 TBs of test data HPE Proliant DL 380 gen 9 Five nodes

ROS, Parquet and ORC format.

99 TPC-DS Queries

Page 17: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Other SQL on Hadoop can’t run all the queries

17

99 99 80

59

29

0 0

19

40

70

V E R T I C A E N T E R P R I S E V E R T I C A S Q L O N H A D O O P I MP A L A H I V E O N T E Z H I V E O N S P A R K

PASSING AND FAILING TPC-DS QUERIES

PASS

FAIL

Page 18: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Boosting Cloudera with Vertica

18

• Vertica Enterprise completes the benchmark in 30% of the time of Impala

• Vertica for SQL on Hadoop completes in 56% of the time as Impala

• Impala could not complete 19 of the 99 queries at all. Those queries were not compared.

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Impala Vertica SQL on Hadoop on Parquet Vertica Enterprise

Seconds to complete benchmark (of runnable queries)

About 11 hours 40 mins

About 6 ½ hours

About 3 ½ hours

Page 19: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Boosting Apache/Hortonworks with Vertica

19

• Vertica Enterprise completes the benchmark in 7% of the time of Hive on Tez

• Vertica SQL on Hadoop using ORC files completes in 12% of the time of Hive on Tez

• Hive on Tez could not complete 40 of the 99 queries at all. Those queries were not compared.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Hive on Tez Vertica SQL on Hadoop Orc Vertica EE

Seconds to complete benchmarks (of runnable queries)

About 21 hours 15 mins

About 2 ¼ hours About 1½ hours

Page 20: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Case Studies

20

Page 21: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Delivering predictive network analytics for Telecommunications companies

21

The Challenge • Data storage requirements increasing

exponentially, but customers expecting analytic response times in seconds, not minutes or hours

• With their previous Oracle system, enlarging storage was complicated and time consuming.

The Solution • Vertica has provided Anritsu with the technology

necessary to implement predictive analytics solutions that have only been theoretical until now

• Realized rapid ROI after implementing Vertica in place of a legacy Oracle solution: 351% ROI with a payback of just 4 months

ROI: 351%

Payback: 4 months

Annual Benefit: $3+ million

Page 22: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Using customer analytics to eliminate level 1 and level 2 support

22

The Challenge • Perform analytics on over 1.5 petabytes of

customer product and performance metadata to fine-tune and continue leading-edge product development and evolution

The Solution

• Leverage Vertica for operational analytics to engage in ongoing communications and with customers about storage environments and optimizations

• Use analytics to understand distribution of customers' workloads and how customers access storage, which helps it design storage solutions that match its customers' use patterns

ROI: 287%

Payback: 6 months

Annual Benefit: $13.6 million

Page 23: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Vertica “Hot Data” Promotion

23

Learn more at www.vertica.com/hotdata

For a limited time, get free nodes of Vertica for SQL on Hadoop when you buy Vertica Premium Edition. This is a great way to perform data discovery and ad-hoc analysis on all of your data!

Page 24: Data Warehouse Modernization - 1105 Mediadownload.1105media.com/tdwi/Remote-assets/Events/2017/Vertica...with no name node or ... PARTITIONED BY (page_view_dt date) STORED AS ORC;

Thank You [email protected]

24