hadoop from hive with stinger to tez

16
www.rubicon.nl Hadoop: From Hive with Stinger to Tez Jan Pieter Posthuma March 5, 2015

Upload: jan-pieter-posthuma

Post on 24-Jul-2015

377 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Hadoop from Hive with Stinger to Tez

www.rubicon.nl

Hadoop: From Hive with Stinger to Tez

Jan Pieter Posthuma

March 5, 2015

Page 2: Hadoop from Hive with Stinger to Tez

2

Introduction

Jan Pieter Posthuma Microsoft Data Consultant

Rubicon, local consultancy firm in the Netherlands

Architect role at multiple projects

Analysis Service, Reporting Service, Big Data, HDInsight,Cloud BI, Power BI

http://twitter.com/jppphttp://linkedin.com/[email protected]

Page 3: Hadoop from Hive with Stinger to Tez

3

Agenda

Hive Stinger Tez

Hadoop

Page 4: Hadoop from Hive with Stinger to Tez

4

Hadoop

Hadoop is a collection of software to create a data-intensivedistributed cluster running on commodity hardware:

‘store and process the data on the Internet in a simple, scalable and economically feasible way’

Widely accepted by Database vendors as a solution for unstructured data

Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight (now on Windows and Linux)

Available on premise and as an Azure service

HortonWorks Data Platform (HDP) 100% Open Source!

Page 5: Hadoop from Hive with Stinger to Tez

5

Why SQL on Hadoop?

Hadoop is great for cost, but MapReduce is too difficult.

SQL on Hadoop makes Hadoop real and gives me scale that traditional SQL can’t offer.

I’m deleting important data because it’s too expensive to store it. $

Page 6: Hadoop from Hive with Stinger to Tez

6

Hive

Developed Hive to address traditional RDBMS limitations.300+ PB of data under management.600+ TB of data loaded daily.60,000+ Hive queries per day.More than 1,000 users per day.Initial Apache release in April 2009

Problem: Hive is bound to MapReduce leading to latency and needs higher performance

Page 7: Hadoop from Hive with Stinger to Tez

7

Stinger

‘Making Apache Hive 100 Times Faster’

Hortonworks blog, February 2013

SQL Engine

Vectorized SQL Engine

ColumnarStorage

ORCFile

= 100X+ +

Distributed Execution

Apache Tez

Page 8: Hadoop from Hive with Stinger to Tez

8

ORCFiles

Started by HortonWorks to optimize existing RCFiles with input from Microsoft to cooperate with QE and Tez

Two goals: Improve query speed Improve storage efficiency

CREATE TABLE … STORED AS ORC

Page 9: Hadoop from Hive with Stinger to Tez

9

Yarn

Page 10: Hadoop from Hive with Stinger to Tez

10

Tez

Page 11: Hadoop from Hive with Stinger to Tez

11

Stinger TPC-DS Benchmark at 30 Terabyte Scale

Sample of 50 queries from TPC-DS at 30 terabyte scale. Average 52x Query Speedup, Maximum 160x Query

Speedup. Total benchmark time decreased from 7.8 days to 9.3

hours.(3)

Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.

Page 12: Hadoop from Hive with Stinger to Tez

12

Stinger.Next

Stinger.Next (in 3 phases) Transactions with ACID semantics – allow users to easily

modify data with inserts, updates and deletes. It extend Hive from the traditional write-once, and read-often system to support analytics over changing data.

Sub-second queries – allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements. Emerge of LLAP (Live Long and Process) and Hive on Spark.

SQL:2011 Analytics – allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.

Page 13: Hadoop from Hive with Stinger to Tez

13

Stor

age

Columnar Storage

ORCFile Parquet

Unstructured Data

JSON CSV

Text Avro

Custom

Weblog

Engi

ne

SQL Engines

Row Engine Vector EngineSQ

LSQL Support

SQL:2011 Optimizer HCatalog HiveServer2

Cach

e

Block Cache

Linux Cache

Dis

trib

uted

Exec

ution

Hadoop 1

MapReduce

Hadoop 2

Tez Spark

Vector Cache

LLAP

Persistent Server

Historical

Current

In Development

Legend

Apache Hive: Modern Architecture

Page 14: Hadoop from Hive with Stinger to Tez

14

Questions

?

Page 15: Hadoop from Hive with Stinger to Tez

15

Links

Microsoft Big Data:http://www.microsoft.com/bigdata

Hortonworks:http://www.hortonworks.com

Try your self via Windows Azure HDInsight:http://azure.com/hdinsight

Page 16: Hadoop from Hive with Stinger to Tez

16

Usefull resources

http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final/

http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/

http://hortonworks.com/labs/stinger/ http://hortonworks.com/blog/100x-faster-hive/ http://www.slideshare.net/hugfrance/recent-enhancements-to-a

pache-hive-query-performance?qid=2cd74ce1-e863-436c-a1ab-52a513c61a27&v=default&b=&from_search=10

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit

http://hortonworks.com/blog/microsofts-contributions-to-the-stinger-initiative-and-apache-hive/