discover hdp 2.2: even faster sql queries with apache hive and stinger.next

© Hortonworks Inc. 2014

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive & Stinger.next

Hortonworks. We do Hadoop.


Speakers

Justin Sears

Hortonworks Product Marketing Manager

Alan Gates

Hortonworks Co-Founder and Apache Hive Committer & PMC Member

Raj Bains

Hortonworks Senior Manger of Product Management for Apache Hive


Agenda

•  Introduction to Stinger.next

•  New Innovation in Apache Hive 0.14 §  SQL: Transactions with ACID semantics

§  Speed: Cost based optimizer for star and bushy joins

§  Scale: Dynamic query optimizations

•  The Road Ahead for Stinger.next

•  Q & A

We’ll move quickly: •  Attendee phone lines are muted

•  Text any questions to Raj Bains using Webex chat •  Questions answered at the end

•  Unanswered questions and answers in upcoming blog post


Big Data, Hadoop & Data Center Re-platforming

Business Drivers

•  From reactive analytics to proactive interactions

•  Insights that drive competitive advantage & optimal returns

Financial Drivers

•  Cost of data systems, as % of IT spend, continues to grow

•  Cost advantages of commodity hardware & open source software

$ Technical Drivers

•  Data is growing exponentially & existing systems overwhelmed

•  Predominantly driven by NEW types of data that can inform analytics

There is an inequitable balance between vendor and customer in the market


Clickstream Capture and analyze website visitors’ data trails and optimize your website

Sensors Discover patterns in data streaming automatically from remote sensors and machines

Server Logs Research logs to diagnose process failures and prevent security breaches

New Types of Data Hadoop Value:

Sentiment Understand how your customers feel about your brand and products – right now

Geographic Analyze location-based data to manage operations where they occur

Unstructured Understand patterns in files across millions of web pages, emails, and documents


A Shift from Reactive to Proactive Interactions

HDP and Hadoop allow organizations to use data to shift interactions from…

Reactive Post Transaction

Proactive Pre Decision

…to Real-time Personalization From static branding

…to repair before break From break then fix

…to Designer Medicine From mass treatment

…to Automated Algorithms From Educated Investing

…to 1x1 Targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Telco


Enterprise Goals for the Modern Data Architecture

•  Consolidate siloed data sets structured and unstructured

•  Central data set on a single cluster

•  Multiple workloads across batch interactive and real time

•  Central services for security, governance and operation

•  Preserve existing investment in current tools and platforms

•  Single view of the customer, product, supply chain

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING Systems

Clickstream Web &Social

Geoloca9on Sensor & Machine

Server Logs

Unstructured


YARN Transformed Hadoop & Opened a New Era

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines


Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark


YARN Extends Hadoop to Other Data Center Leaders

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

•  Supports 3rd-party ISV tools

(ex. SAS, Syncsort, Actian, etc.)

YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions


1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines


Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider


In-Memory

Spark


Enterprise Hadoop: Central Set of Services


1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:

•  Governance

•  Operations

•  Security

Everything that plugs into Hadoop inherits these services

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo


In-Memory

Spark

Others

ISV Engines



Tez Slider Slider Tez Tez


Hortonworks Development Investment for the Enterprise

Vertical Integration with YARN and HDFS

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °


Ambari

Zookeeper

Scheduling

Oozie


according to policy


manage the platform





Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo


In-Memory

Spark

Others

ISV Engines




•  Ensure engines can run reliably and respectfully in a YARN based cluster •  Implement features throughout the stack to accommodate


Hortonworks Development Investment for the Enterprise

Horizontal Integration for Enterprise Services

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °


Ambari

Zookeeper

Scheduling

Oozie


according to policy


manage the platform





Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo


In-Memory

Spark

Others

ISV Engines




•  Ensure consistent enterprise services are applied across the entire Hadoop stack •  Integrate with and extend existing data center solutions for these key requirements


Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop


1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °


Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark


Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

On-Premises


Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °


Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

GOVERNANCE OPERATIONS

In-Memory

Spark


Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

SECURITY

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows On-Premises Cloud



SQL

Hive

Tez


Introduction to Stinger.next


Stinger.next – Enterprise SQL at Hadoop Scale

Stinger (Hive 0.13, Tez, ORC File)

Scale to Petabytes

Batch to Interactive Queries

Read-Only Data

Substantial SQL Support

Single Tool for Multiple SQL workloads – Interactive, Reporting and ETL

MapReduce, Tez Engines

Stinger.next

Scale to Petabytes

Sub-Second Queries

Modify Data with Transactions

Comprehensive SQL:2011 Analytics

Single Tool for Multiple SQL workloads – Interactive, Reporting, ETL, ML

MapReduce, Tez, Spark Engines


SQL in Hive 0.14: Transactions with ACID Semantics


Transaction Use Cases Reporting with Analytics (YES) •  Reporting on data with occasional updates •  Corrections to the fact tables, evolving dimension tables

•  Low concurrency updates, low TPS

Operational Reporting (YES, next) •  High throughput ingest from operational (OLTP) database

•  Periodic inserts every 5-30 minutes

•  Requires tool support and changes in our Transactions

Operational (OLTP) Database (NO) •  Small Transactions, each doing single line inserts

•  High Concurrency - Hundreds to thousands of connections

Hive

OLTP Hive Replication

Analytics Modifications

Hive

High Concurrency OLTP


Deep Dive: Transactions Transaction Support in Hive with ACID semantics •  Hive native support for INSERT, UPDATE, DELETE. •  Split Into Phases:

•  Phase 1: Hive Streaming Ingest (append) •  Phase 2: INSERT / UPDATE / DELETE Support •  Phase 3: BEGIN / COMMIT / ROLLBACK Txn

[Done]

[HDP 2.2]

[Next]

Read-Optimized ORCFile

Delta File Merged Read-

Optimized ORCFile

1. Original File Task reads the latest

ORCFile

Task

Read-Optimized ORCFile

Task Task

2. Edits Made Task reads the ORCFile and merges

the delta file with the edits

3. Edits Merged Task reads the updated ORCFile

Hive ACID Compactor periodically merges the delta

files in the background.


Speed in Hive 0.14: Cost Based Optimizer


TPC-DS Query 17

SELECT i_item_id, i_item_desc, s_state, Count(ss_quantity) AS store_sales_quantitycount, Avg(ss_quantity) AS store_sales_quantityave, Stddev_samp(ss_quantity) AS store_sales_quantitystdev, Stddev_samp(ss_quantity) / Avg(ss_quantity) AS store_sales_quantitycov, Count(sr_return_quantity) as_store_returns_quantitycount, Avg(sr_return_quantity) as_store_returns_quantityave, Stddev_samp(sr_return_quantity) as_store_returns_quantitystdev, Stddev_samp(sr_return_quantity) / Avg(sr_return_quantity) AS store_returns_quantitycov, Count(cs_quantity) AS catalog_sales_quantitycount, Avg(cs_quantity) AS catalog_sales_quantityave, Stddev_samp(cs_quantity) / Avg(cs_quantity) AS catalog_sales_quantitystdev, Stddev_samp(cs_quantity) / Avg(cs_quantity) AS catalog_sales_quantitycov FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_quarter_name = '2000Q1' AND d1.d_date_sk = store_sales.ss_sold_date_sk AND ss_sold_date BETWEEN '2000-01-01' AND '2000-03-31' AND item.i_item_sk = store_sales.ss_item_sk AND store.s_store_sk = store_sales.ss_store_sk AND store_sales.ss_customer_sk = store_returns.sr_customer_sk AND store_sales.ss_item_sk = store_returns.sr_item_sk AND store_sales.ss_ticket_number = store_returns.sr_ticket_number AND store_returns.sr_returned_date_sk = d2.d_date_sk AND d2.d_quarter_name IN ( '2000Q1', '2000Q2', '2000Q3' ) AND sr_returned_date BETWEEN '2000-01-01' AND '2000-09-01' AND store_returns.sr_customer_sk = catalog_sales.cs_bill_customer_sk AND store_returns.sr_item_sk = catalog_sales.cs_item_sk AND catalog_sales.cs_sold_date_sk = d3.d_date_sk AND d3.d_quarter_name IN ( '2000Q1', '2000Q2', '2000Q3' ) AND cs_sold_date BETWEEN '2000-01-01' AND '2000-09-31' GROUP BY i_item_id, i_item_desc, s_state ORDER BY i_item_id, i_item_desc, s_state LIMIT 100;


CBO on Selected Queries – 17

store_sales store_returns catalog_sales

items store

date_dim d1 date_dim d2 date_dim d3

Filter: quarter Filter: quarter Filter: quarter

Filter: date Filter: date Filter: date

customer_sk ticket_number

customer_sk Item_sk

date_sk date_sk date_sk

item_sk store_sk


OLD: Left Deep Plan

Reducer 3 •  Merge join 2 & 10 •  Map join 1 •  Map join 6 •  Map Join 7 •  Map Join 8 store •  Map Join 11 item •  Filter •  Group By •  Reduce

Map 12 Table_scan

Store_returns

Map 6 Table_scan d2, filter


Reducer 4 Group_By Reduce

Reducer 10 Merge join 12, 9

Map 9 Table_scan store_sales


Map 2 Table_scan catalog_sales

Reducer 5 Limit

B

B

B

Map 11 Table_scan item

Map 8 Table_scan store B

Large Fact tables joined together without filters

B


NEW: Complex Bushy Plan

Reducer 4 Merge join 3 & 8 Map join store Map join item

Reduce

Map 10 table_scan

store

Map 12 Table_scan

item

Map 3 Store_sales

Map join

Map 8 Store_returns

Map join

Reducer 5 Merge_Join Group_By Reduce

Map 11 catalog_sales,

Map Join

Map 9 Table_scan d1,

filter


filter


filter

Reducer 6 Group by Reduce

Reducer7 Limit

B

B B

B B

All 3 Large Fact tables joined with date dimension limiting data to few quarters


Performance Improvement – Query 17

Scale = 30TB Input records ~186mil

CBO Elapsed Time (sec)

Elapsed Time

Intermediate data (GB)

Output and Intermediate Records

OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360


Scale in Hive 0.14: Dynamic Query Optimization


Auto Reducer Parallelism

Use dynamic data volume during execution

rather than estimates from query compilation to determine the number of reducers

Leads to

faster query execution,

better resource utilizations

App Master

Vertex Manager

Vertex State

Machine

Time

1. Data size statistics

Tasks for a single map vertex

Tasks for a single reduce vertex

2. Set parallelism

3. Re-route

4. Cancel task

App Master

Vertex Manager

Vertex State

Machine

5. Tasks Completed


Tasks for a single reduce vertex

6. Start Tasks

7. Start


Auto Reducer Parallelism

use tpcds_bin_partitioned_orc_30000; set hive.tez.auto.reducer.parallelism=true; set hive.tez.min.partition.factor=0.125; SELECT ss_promo_sk, Sum(ss_sales_price), Count(*) FROM store_sales WHERE ss_sold_date < '1998-03-01' GROUP BY ss_promo_sk ORDER BY 2 DESC LIMIT 10;


Dynamic Partition Pruning

store_sales

date_dim d1 Filter

ss_sold_date_sk = date_sk

Table Definition create table store_sales (...) partitioned by (ss_sold_date_sk int) stored as orc;

d1 d2 d3 d4 …

Example Join of •  a large Fact table with multiple partitions •  with a dimension table that has a filter

The ss_sold_date_sk partitions that can be pruned away at join time is not known till the filter is applied at runtime

Compile Time Design •  Insert synthetic conditions for each join representing "x in

(keys of other side in join)”. Optimizer will push it as far down as possible

•  If the condition hits a table scan and the column involved is a partition column:

•  Setup Operator to send key events to AM •  else:

•  Remove synthetic predicate

App Master

Vertex Manager

Vertex State

Machine

1. Send events for partition pruning




Dynamic Pruning

TPC-DS Query 3 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;


Stinger.next: The Road Ahead


Stinger.next - Delivery Themes

Beyond Read-‐Only 2nd Half 2014

•  Transac(ons with ACID allowing insert, update and delete

•  Temporary Tables

•  Cost Based Op(mizer op(mizes star and bushy join queries

Sub-‐Second 1st Half 2015

•  Sub-‐Second queries with LLAP

•  Hive-‐Spark Machine Learning integra(on

•  Opera(onal repor(ng with Hive Streaming Ingest and Transac(ons

•  Replica(on and SQL/CBO improvements

Richer Analy9cs 2nd Half 2015

•  Toward SQL:2011 Analy(cs

•  Materialized Views

•  Cross-‐Geo Queries

•  Workload Management via YARN and LLAP integra(on


Q & A


Thank you! Learn more at: hortonworks.com/hadoop/hive/

Register for the remaining 6 Discover HDP 2.2 Webinars

Hortonworks.com/webinars

discover hdp 2.2: even faster sql queries with apache hive and stinger.next

Software

big data

apache hadoop

enterprise hadoop

hadoop alloworganizations

retaila shift

proactive interactionsa

new innovation

automated algorithmsa