socal bigdata day

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

SQL on Hadoop- Batch, Interactive and BeyondSoCal Big Data Day

John ParkSolution Engineer, HortonworksRm 138-140


Disclaimer

This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.

Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.

This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.

Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.


Presenter John Park• Solution Engineer, SoCal• Data Science ETL, data warehousing,

software design, architecture • Previous – Various Startups, Qlik,

DW consultant, NCR• Current – Helping customers

implement and understand open source big data platforms

• Twitter: @jpark328• Email: [email protected]

Wei Wang (Marketing)

Graphic is being changed


Graphic has changed


Before We Began

• We have a Raffle• 2 winner at the end of

presentation• Prize – Amazon Echo Dot• Ask Questions

https://www.surveymonkey.com/r/940amSQLHadoopBatch

Survey Link




SQL is King

Why ?– Familiarity

• Primary Technical language or Business Analyst– Powerful

• Maturation of RDBMS, EDW, OLTP• ACID Compliant

– Flexible• Covers Transactional Processes to Analytics

– Pervasive• Emergence of BI tools(Tableau, BOBJ, Cognos),• Deep ecosystem of tools




Graphic has changed


Overview of SQL on Hadoop Solutions

Spark's module for working with structured data. Run SQL queries alongside complex analytic algorithms.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

High performance relational database layer over HBase for low latency applications.

TraditionalMPP onHadoop

Many traditionally architected MPP solutions have been ported to Hadoop and some new ones have been developed from scratch.


SQL on Hadoop: Vitals

Project First GA Release Lines of Code(June 2015) (*) Most Typical Use

Apache Hive April, 2009 (7 Years) 1 Million EDW / ETL OffloadSparkSQL March, 2015 (2 years) 56.6k Exploratory Analytics

Apache Phoenix March, 2014 (3 Year) 200k Low-Latency Dashboards


Apache Hive: Fast Facts

Most Queries Per Hour

100,000 Queries Per Hour(Yahoo Japan)

Analytics Performance

100 Million rows/s Per Node(with Hive LLAP)

Largest Hive Warehouse

300+ PB Raw Storage(Facebook)

Largest Cluster

4,500+ Nodes(Yahoo)


Phoenix and HBase: Fast Facts

Largest Database

5 Petabytes(Flurry)

Best Known App

Facebook Messages(Facebook)

Fastest Ingestion

10 Million Events/s(Yahoo)

Biggest SQL App

Real-Time SQL on 140m+ Records(PubMatic)


Apache Hive: Strengths and Cautions

• Huge Datasets• Deep SQL Analytics• EDW Offload• BI Integration

Strengths+• Near-Real-Time

Cautions?


SparkSQL: Strengths and Cautions

• Language-Integrated Query• Exploratory Analytics

Strengths+• Large Datasets• High Concurrency• EDW Offload

Cautions?


Apache Phoenix: Strengths and Cautions

• High Concurrency• Near-Real-Time Query• Fast Updates

Strengths+• Deep SQL Analytics• Full-Table Scans / Scaled Analytics• Existing BI Integrations

Cautions?


SQL on Hadoop - Good to know No One Size Fits all solution Use Cases and Query Patterns are important Prototype and Fail Fast Define Scalability and Performance criteria




Please see updated Graphic

SQL on Hadoop Use Cases


Hive: Analytics Use Case

Financial Services Company:– Analyze large dataset to identify potential fraud.– Re platformed from a mature EDW platform.– Selection drivers: Breadth of SQL support, query performance, cloud consumption.

Use Case Vitals:– Analyze > 25 billion transactions per week.– More than 1.5 TB new data per day.– > 4PB historical data available for analysis through cloud infrastructure.


Hive Performance with Scaling - Customer results on HDP 2.2

Multi join - Allocation Aggregation Total0

500

1,000

1,500

2,000

2,500

3,000

Scalability on Hive5 nodes 10 nodes 20 nodes 40 nodes 60 nodes

Elap

sed

time

(sec

onds

)

Benchmark test 5 nodes 10 nodes 20 nodes 40 nodes* 60 nodes*

Multi join 24:02 14:33 10:32 06:54 05:49

Aggregation 21:59 12:20 07:55 05:16 02:38

Total 46:02 26:53 18:27 12:10 08:27

Same Workload on EDW -- Full Rack 8:00(*) Projected times based on 5, 10 and 20 node results.

Aggregation Workload

• 5% more time required on Hive.

• < 50% solution cost versus traditional EDW.


SparkSQL Use Case: Medical

Sensor Data HDFS Aggregations (Hive)

HCatalog

Analytical Tools

JDBC Connector

SparkSQL

- Sensor data streamed into HDFS- Large-scale pre-aggregations done using Hive- SparkSQL powered dashboard for fast analytics.


+

Phoenix at PubMatic

Near-Real-Time SQL over >15 TB of DataUsing Apache Phoenix


Apache Phoenix at PubMatic

Key Concerns SolutionPubMatic offers marketing automation with real-time analytics that enable publishers to make smarter and faster decisions.

To empower publishers to make real-time decisions, PubMatic needs a SQL solution that scales to terabytes of data yet can process hundreds of thousands of queries daily with near-real-time SLAs.

Phoenix is the only Open Source SQL Solution for Hadoop designed for near-real-time querying, giving PubMatic’s publishers the timely insight they need to optimize their advertising strategies.

Phoenix’s linear scalability enables PubMatic to offer real-time query over more than 15 terabytes of data using commodity hardware.

Phoenix’s ANSI SQL Interface make it easy for publishers to slice and dice data the way they want.

Read more at http://phoenix.apache.org/who_is_using.html

http://phoenix.apache.org/who_is_using.html

SQL on Hadoop Next Evolution


Evolution of Hive

Batch/ETL(HDP 2.2)

• Transactions with ACID allowing insert, update and delete

• Temporary Tables

• Cost Based Optimizer optimizes complex join queries well.

Faster SQL

• Tech Preview: Sub-5-Second queries with LLAP

• Usability: SQL Query Editor, Visual Explain and Debugging

• Transparent Data Encryption• Cross-Site Replication• SQL, Performance Improvements• Hive-on-Spark (Alpha / Beta)

Sub-Second withRich Analytics

• Rich SQL:2011 Analytics

• Tech Preview : Druid OLAP Index for Hive

• GA: Sub-Second queries with LLAP

• Transaction Improvements (BEGIN/COMMIT/ROLLBACK, MERGE)

Phase 1(Delivered: HDP 2.2)

Phase 2(Delivered: HDP 2.5)

Phase 3(Planned: HDP 2.6*)


Apache Hive: Modern ArchitectureSt

orag

e

Columnar Storage

ORCFile ParquetUnstructured Data

JSON CSV

Text Avro

Custom

Weblog

Engi

ne

SQL Engines

Row Engine Vector Engine

SQL

SQL Support

SQL:2011 Optimizer HCatalog HiveServer2Ca

che

Block Cache

Linux Cache

Dist

ribut

edEx

ecuti

on

Hadoop 1

MapReduceHadoop 2

Tez Spark

Vector Cache

LLAP

Persistent Server

Historical

Current

In Development

Legend


Sub-Second Hive with LLAP

Sub Second:• LLAP: Persistent server to instantly execute SQL queries.• Caches hottest data in RAM.• Overcomes latencies associated with Hive on Tez or Hive on Spark.

SQL Compatibility:• 100% Compatible with Hive SQL.• Compatible with existing tools (BI, ETL, etc.)

Security:• Security via HiveServer2.• Integrates with Apache Ranger.

HadoopNode

HadoopNode

HadoopNode

Vector Cache

LLAPServer

Vector Cache

LLAPServer

Vector Cache

LLAPServer

HiveSever2

LLAP Servers(1 Per Hadoop Node)

Hive SQL


Hive 2 with LLAP: Architecture Overview

Deep

St

orag

e

HDFS S3 + Other HDFS Compatible Filesystems

YARN Cluster

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC /JDBC SQL

Queries In-Memory Cache(Shared Across All Users)

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Types SQL Features File Formats FuturesNumeric Core SQL Features Columnar ACID MERGE

FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi SubqueryDECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar SubqueriesINT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-EquijoinsBOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT

String UNION ALL LogfileCHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEsBLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints

Date, Time UNION DISTINCT JSON Default ValuesDATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi Table Transactions

Complex Types OLAP and Windowing Functions Custom FormatsARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features

Nested Data Analytics CUBE and Grouping Sets XPath AnalyticsNested Data Traversal ACID TransactionsLateral Views INSERT / UPDATE / DELETE

Procedural Extensions ConstraintsHPL/SQL Primary / Foreign Key (Non Validated)

Apache Hive: Journey to SQL:2011 Analytics

LegendNew

Projected: HDP 3.0

HDP 2.6

Track Hive SQL:2011 Complete: HIVE-13554


Phoenix SQL: Today and Tomorrow

Phoenix: SQL for HBaseSQL Datatypes (VARCHAR, INTEGER, etc.)

UNION ALL

JOINs: Inner, Left/Right Outer, Cross Functional IndexesUPSERT / DELETE Date / Time FunctionsDerived Tables UDFsGROUP BY, ORDER BY, HAVING Multi Table TransactionsAVG, COUNT, MIN, MAX, SUM SQL GRANT / REVOKEPrimary keys, NOT NULL constraints Replication ManagementCASE, COALESCE Column Constraints and DefaultsVIEWs OLAP, Cubing, RollupSecondary Indexes UNIONFlexible Schema

Current

Future

Phoenix 4.4


Looking forward - What Is Druid?

Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query.

Features:• Streaming Data Ingestion• Sub-Second Queries• Merge Historical and Real-Time Data• Approximate Computation


Druid’s Role in Scalable Data WarehousingUI

Core Platform

S3 or HDFS

HiveServer2

MDX

Unified SQL and MDX Layer

SQL BI Tools MDX Tools

Hive

Realtime Feeds(Kafka, Storm, etc.)

DruidOLAP Indexes

HiveServer2

Hive SQL

Thrift Server

SparkSQL

Fast SQL MDX

Superset UI

Fast Exploration

Builder UI

SmartSense

Ranger

Atlas

Ambari Management

SQL on Hadoop Conclusion


query43.sq

l

query73.sq

l

query63.sq

l

query3.sq

l

query7.sq

l

query89.sq

l

query34.sq

l

query42.sq

l

query27.sq

l

query52.sq

l

query55.sq

l

query13.sq

l

query79.sq

l

query98.sq

l

query19.sq

l0

50

100

150

200

250

0

5

10

15

20

25

30

35

40

45

50

Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

Que

ry T

ime(

s) (L

ower

is B

etter

)

Spee

dup

(x F

acto

r)

Hive 2 with LLAP: 26x Performance Boost


SQL on Hadoop: Investment Areas

Interactive Performance

Caching in Flash / SSDFast Analytics on Raw Text

Materialized Views

SQL Compliance

Comprehensive SQL:2011Support

SQL ACID

SQL Standard MERGE

EDW Integrations

Joint AtScale / Syncsort RoadmapOLAP Indexes with Druid


SQL on Hadoop Summary

Project Strengths Use Cases Unique CapabilitiesApache Hive • Most Comprehensive SQL

• Scale• Maturity

• ETL Offload• Reporting• Large-Scale Aggregations

• Robust Cost-Based Optimizer

• Mature Ecosystem (BI, Backup, Security, Replication)

SparkSQL • In-Memory• Low Latency

• Exploratory Analytics• Dashboards

• Language-Integrated Query

Apache Phoenix • Real-Time Read/Write• Transactions• High Concurrency

• Dashboards• System-of-Engagement• Drill-Down / Drill-Up

• Real-Time Read/Write


Scalable Data Warehousing on Hadoop: Overview

Other ETL Tools

Ingest and Store ETL, Data Mining,Advanced Analytics

Interactive SQL,Reporting, OLAP

Kafka

HDFS

NiFi Druid (Future)

Hive LLAP

HAWQ

AtScale

Spark

Hive

HPL / SQL

ACID

AtlasGovernance and

Lineage

RangerAdvanced Security

SyncsortDMX-h

ETL

Zeppelin

Ambari Hive View

BI Tools

Reporting Tools


Thank You



socal bigdata day

Technology