jethro data meetup index base sql on hadoop - oct-2014

JethroData Indexed Based SQL-on-Hadoop - An Architectural Comparison of Tools

Simpler. Faster. Cheaper.

About the presenter

• Over 25 years experience in software design & mgmt• Expertise in database architecture, information

retrieval and search technologies• Led numerous information retrieval projects for

various Israeli intelligence agencies as well as for commercial companies

• Started JethroData in 2010 with the idea of integrating database and search technologies to accelerate big data analytics

• Bachelor's degree in Computer Science and Philosophy from the Tel-Aviv University

Boaz Raufman – Co-Founder / CTO

SQL-on-Hadoop

Hadoop uses the same parallel design pattern as the parallel databases from last decade

Frameworks

• MapReduce• Tez• Spark

Reborn on Hadoop

• Pivotal HAWQ

• IBM BigSQL• Teradata

Aster• Actian

New Comers

• Hive• Impala• Presto• Tajo• Drill• Spark SQL

Data Node

Full-Scan Execution

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

Data Node

Data Node

Data Node

Data Node

Query Executor

Query Executor

Query Executor

Query Executor

Query Executor

Query Planner

/Mgr

Query Planner/

Mgr

Query Planner/

Mgr

Query Planner/

Mgr

Query Planner/

Mgr

Performance and resources based on the size of the dataset

Shared-Nothing MPP Design Principles

Parallel Processing• Divide the work across

many nodes• Try to minimize inter-

node communication• Work should be evenly

distributed

Full Data Scanning• Full sequential scan -

massive I/O• Data locality and local

processing• Minimize amount of data

being read– Columnar data store– Partition by specific key– Block stats

Performance and resource requirement based on the dataset size

MPP Complex queries processing

LocalAggregation

GlobalAggregation

(join, distinct,group by, order by,

sub-query)

Result Merge

Example:SELECTDAY,COUNT( DISTINCT ITEM)FROM T1WHEREPRODUCT=‘abc’GROUP BY DAY

Index based SQL-on-Hadoop

Data Node

The Index-Access Design

Data Node

Data Node

Data Node

Data Node

Jethro Query Node

Query Node

Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day

1. Index Access 2. Read data only for require rows

Performance and resources based on the size of the result-set

Index-Based Design

• Surgical scan – minimum I/O• Performance and required resources based

on the result set size• Extremely efficient for Interactive SQL use

cases• Pay at load time

Architecture – Contrarian Concepts

• Index everything– Every column is indexed

• Colum oriented– Columnar or row-groups– Append only data model

• Everything is stored in HDFS– Can also work with S3 or Posix

• Shared Everything– Separate compute and storage, each

scales-out independently– Minimize cross-node operations– Stateless Query nodes

• Parallelized multi-threaded execution– multiple parallelization dimensions:

columns, row ranges, partitions, pipelining and bucketing

JethroNode

Jethro Node

Storage Layer

Processing Layer

HDFS/Posix/S3

Client

Jethro Indexes

Indexes map each column value to a list of rows

Jethro stores indexes as hierarchical compressed bitmaps Very fast query operations – AND / OR / NOT Processed the entire WHERE clause to a final list of rows Patent pending: http://

www.google.com/patents/WO2013001535A3?cl=en INSERT Performance

– Load is very fast: files are appended, no random read/write, no locks– Jethro Indexes are append-only. If needed, duplicate entries are allowed– Periodic background merge (non-blocking)– Compatible with HDFS

Value Rows

FR rows 5,9,10,11,14

IL rows 1,3,7,12,13

US rows 2,4,6,8,15

http://www.google.com/patents/WO2013001535A3?cl=en

http://www.google.com/patents/WO2013001535A3?cl=en

Built in optimizations

• Code is written in C++• Column store and true column processing• Vectorization for expression evaluation• Multi-threaded and parallelized execution• Planner using indexes meta data - index-based

queries• Server-side cache in memory and local disk

Full-Scan: Performance depends on size of datasetIndex-Access: Performance depends on size of result-set

Use-Case Analysis

Comparing Recent Benchmarks – Jun 2014

Impala Parquet Vs. Hive/Tez, Presto, Shark

Source

Jethro Vs. Impala/Parquet

Source

Impala

Using the same queries in Jun-2014 Impala benchmarks, we compared Impala with Jethro (TPC-DS, SF 1,000)

http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/

http://jethrodata.com/comparing-jethrodata-to-impalas-interactive-benchmark/

Benchmark – Jethro vs. Impala – Oct 2014

q19 q42 q52 q55 q63 q68 q73 q980

20

40

60

80

100

120

140

160

180

200

103

39.8 39.4 39.9

73.4

188.4

84.3 85.2

6.4 5 4.9 412.3 11.7 10.3 4.4

TPC-DS Interactive queriesOct 2014

Impala 1.4.2 Jethro 0.9

* Queries use original TPC-DS filter criteria

*

DEMO

Go to Tableau demo

1. Point browser at: http://54.245.114.83/ 2. Login as try-jethro/jethro1233. Edit workbooks:

1. Jethro: Jethro sd – save2. Impala: Impala sd save

http://54.245.114.83/authoring/Jethrosd-save/Sheet1#1

http://54.245.114.83/authoring/Jethrosd-save/Sheet1#1

http://54.245.114.83/

http://54.245.114.83/

Jethro Query Node

Side-By-Side Implementation

MapReduce / Impala

Jethro Indexer

DataStream

Jethro Query Nodes

▪▪

▪▪▪

▪▪ ▪ ▪▪▪

▪▪ ▪▪▪▪▪

▪

▪▪

Hive / Pig

▪▪▪▪▪

▪▪

▪▪▪

BI Tools

SQL

▪

Existing Hadoop tables are untouched

Indexes are added to select

tables. ~30% incremental

storage

1. Installing JethroData

• Existing Hadoop cluster– CDH 4.x, CDH 5.x, HDP 2.x, EMR 3.x

• Designated Jethro server– Can be inside or outside the cluster– HW: CPU: 16+ cores, Mem: 64GB+, Net: 1GB / 10GB, SSD for

cache• Install Jethro – download package

– rpm install– Install HDFS client (if needed) – Create /jethro dir in HDFS

• Start Jethro– service jethro start

2. Load Data into Jethro

• Run create instance script– JethroAdmin create-instance demo /Jethro/demo

• Define a new table– JethroClient demo localhost 9111

• Create table sales_demo (…);

• Run JethroLoader process– JethroLoader demo sales_demo.desc

sales_demo.csv &• Start Querying – ODBC, JDBC, JethroClient

That’s it!

Road Map

• Jethro S3• Analytic functions• UDF• Cascading optimizer• Function indexes• Light weight text search• Rows group format (Parquet/ORC)• Integration with YARN for resource management• Sync with Hive Metastore/HCatalog• Nested data• Materialized views• Distributes query

Functional Indexes

Solution• Function index created for commonly

used functions• Some function indexes are automatically

created for specific data types. Example:year function for timestamp

• Query optimizer will identify scenarios where functional indexes should be used

• Function index can also be user defined or created on the fly via adaptive optimization

Base Index Function YAER index

Value Rows Value Rows

02/04/2007 5,9,10,11,14 20071,2,3,4,5,6,7,

8,9,10,11,12,

13,14,15

03/05/2007 1,3,7,12,13

10/10/2007 2,4,6,8,15

01/02/2008 15,18 2008 15,16,17,18

05/03/2008 16,17

Query uses function index examples:year(c1)=2007 explicit use for year indexC1 between 2007-01-01 and 2007-12-31 implicit use for year indexC1 between 2007-01-01 and 2008-02-15 Mix: take year index for 2007 and rest from base index

ProblemHow to accelerate this query:Select count(*) from T where year(birthdate)=2007;

Jethro’s Benefits

Simple to use

• Implemented side by side with existing Hadoop system

• Access via SQL or your favorite BI tool

• Integrates with Hadoop eco system

10X Faster queries

• Interactive analysis with Sub second latency

• Access to data as it arrives

• Analyze granular, raw data

50% Cheaper to operate

• Significantly less computing resources

• No dual systems, costly ETL

• Elastically scalable on commodity hardware

• Point browser at: http://www.jethrodata.com/home

• Click • Register:

Try it Today

http://www.jethrodata.com/home

http://www.jethrodata.com/home

Jethro – Big Data Analytics. Real-Time.

Thank You!

jethro data meetup index base sql on hadoop - oct-2014

Software