jethro data meetup index base sql on hadoop - oct-2014
DESCRIPTION
JethroData Index based SQL on Hadoop engine. Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro. SQL and NoSQL NYC meetup Oct 20 2014 Boaz RaufmanTRANSCRIPT
JethroData Indexed Based SQL-on-Hadoop - An Architectural Comparison of Tools
Simpler. Faster. Cheaper.
About the presenter
• Over 25 years experience in software design & mgmt• Expertise in database architecture, information
retrieval and search technologies• Led numerous information retrieval projects for
various Israeli intelligence agencies as well as for commercial companies
• Started JethroData in 2010 with the idea of integrating database and search technologies to accelerate big data analytics
• Bachelor's degree in Computer Science and Philosophy from the Tel-Aviv University
Boaz Raufman – Co-Founder / CTO
SQL-on-Hadoop
Hadoop uses the same parallel design pattern as the parallel databases from last decade
Frameworks
• MapReduce• Tez• Spark
Reborn on Hadoop
• Pivotal HAWQ
• IBM BigSQL• Teradata
Aster• Actian
New Comers
• Hive• Impala• Presto• Tajo• Drill• Spark SQL
Data Node
Full-Scan Execution
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
Data Node
Data Node
Data Node
Data Node
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Planner
/Mgr
Query Planner/
Mgr
Query Planner/
Mgr
Query Planner/
Mgr
Query Planner/
Mgr
Performance and resources based on the size of the dataset
Shared-Nothing MPP Design Principles
Parallel Processing• Divide the work across
many nodes• Try to minimize inter-
node communication• Work should be evenly
distributed
Full Data Scanning• Full sequential scan -
massive I/O• Data locality and local
processing• Minimize amount of data
being read– Columnar data store– Partition by specific key– Block stats
Performance and resource requirement based on the dataset size
MPP Complex queries processing
LocalAggregation
GlobalAggregation
(join, distinct,group by, order by,
sub-query)
Result Merge
Example:SELECTDAY,COUNT( DISTINCT ITEM)FROM T1WHEREPRODUCT=‘abc’GROUP BY DAY
Index based SQL-on-Hadoop
Data Node
The Index-Access Design
Data Node
Data Node
Data Node
Data Node
Jethro Query Node
Query Node
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
1. Index Access 2. Read data only for require rows
Performance and resources based on the size of the result-set
Index-Based Design
• Surgical scan – minimum I/O• Performance and required resources based
on the result set size• Extremely efficient for Interactive SQL use
cases• Pay at load time
Architecture – Contrarian Concepts
• Index everything– Every column is indexed
• Colum oriented– Columnar or row-groups– Append only data model
• Everything is stored in HDFS– Can also work with S3 or Posix
• Shared Everything– Separate compute and storage, each
scales-out independently– Minimize cross-node operations– Stateless Query nodes
• Parallelized multi-threaded execution– multiple parallelization dimensions:
columns, row ranges, partitions, pipelining and bucketing
JethroNode
Jethro Node
Storage Layer
Processing Layer
HDFS/Posix/S3
Client
Jethro Indexes
Indexes map each column value to a list of rows
Jethro stores indexes as hierarchical compressed bitmaps Very fast query operations – AND / OR / NOT Processed the entire WHERE clause to a final list of rows Patent pending: http://
www.google.com/patents/WO2013001535A3?cl=en INSERT Performance
– Load is very fast: files are appended, no random read/write, no locks– Jethro Indexes are append-only. If needed, duplicate entries are allowed– Periodic background merge (non-blocking)– Compatible with HDFS
Value Rows
FR rows 5,9,10,11,14
IL rows 1,3,7,12,13
US rows 2,4,6,8,15
Built in optimizations
• Code is written in C++• Column store and true column processing• Vectorization for expression evaluation• Multi-threaded and parallelized execution• Planner using indexes meta data - index-based
queries• Server-side cache in memory and local disk
Full-Scan: Performance depends on size of datasetIndex-Access: Performance depends on size of result-set
Use-Case Analysis
Comparing Recent Benchmarks – Jun 2014
Impala Parquet Vs. Hive/Tez, Presto, Shark
Source
Jethro Vs. Impala/Parquet
Source
Impala
Using the same queries in Jun-2014 Impala benchmarks, we compared Impala with Jethro (TPC-DS, SF 1,000)
Benchmark – Jethro vs. Impala – Oct 2014
q19 q42 q52 q55 q63 q68 q73 q980
20
40
60
80
100
120
140
160
180
200
103
39.8 39.4 39.9
73.4
188.4
84.3 85.2
6.4 5 4.9 412.3 11.7 10.3 4.4
TPC-DS Interactive queriesOct 2014
Impala 1.4.2 Jethro 0.9
* Queries use original TPC-DS filter criteria
*
DEMO
Go to Tableau demo
1. Point browser at: http://54.245.114.83/ 2. Login as try-jethro/jethro1233. Edit workbooks:
1. Jethro: Jethro sd – save2. Impala: Impala sd save
Jethro Query Node
Side-By-Side Implementation
MapReduce / Impala
Jethro Indexer
DataStream
Jethro Query Nodes
▪▪
▪▪▪
▪▪ ▪ ▪▪▪
▪▪ ▪▪▪▪▪
▪
▪▪
Hive / Pig
▪▪▪▪▪
▪▪
▪▪▪
BI Tools
SQL
▪
Existing Hadoop tables are untouched
Indexes are added to select
tables. ~30% incremental
storage
1. Installing JethroData
• Existing Hadoop cluster– CDH 4.x, CDH 5.x, HDP 2.x, EMR 3.x
• Designated Jethro server– Can be inside or outside the cluster– HW: CPU: 16+ cores, Mem: 64GB+, Net: 1GB / 10GB, SSD for
cache• Install Jethro – download package
– rpm install– Install HDFS client (if needed) – Create /jethro dir in HDFS
• Start Jethro– service jethro start
2. Load Data into Jethro
• Run create instance script– JethroAdmin create-instance demo /Jethro/demo
• Define a new table– JethroClient demo localhost 9111
• Create table sales_demo (…);
• Run JethroLoader process– JethroLoader demo sales_demo.desc
sales_demo.csv &• Start Querying – ODBC, JDBC, JethroClient
That’s it!
Road Map
• Jethro S3• Analytic functions• UDF• Cascading optimizer• Function indexes• Light weight text search• Rows group format (Parquet/ORC)• Integration with YARN for resource management• Sync with Hive Metastore/HCatalog• Nested data• Materialized views• Distributes query
Functional Indexes
Solution• Function index created for commonly
used functions• Some function indexes are automatically
created for specific data types. Example:year function for timestamp
• Query optimizer will identify scenarios where functional indexes should be used
• Function index can also be user defined or created on the fly via adaptive optimization
Base Index Function YAER index
Value Rows Value Rows
02/04/2007 5,9,10,11,14 20071,2,3,4,5,6,7,
8,9,10,11,12,
13,14,15
03/05/2007 1,3,7,12,13
10/10/2007 2,4,6,8,15
01/02/2008 15,18 2008 15,16,17,18
05/03/2008 16,17
Query uses function index examples:year(c1)=2007 explicit use for year indexC1 between 2007-01-01 and 2007-12-31 implicit use for year indexC1 between 2007-01-01 and 2008-02-15 Mix: take year index for 2007 and rest from base index
ProblemHow to accelerate this query:Select count(*) from T where year(birthdate)=2007;
Jethro’s Benefits
Simple to use
• Implemented side by side with existing Hadoop system
• Access via SQL or your favorite BI tool
• Integrates with Hadoop eco system
10X Faster queries
• Interactive analysis with Sub second latency
• Access to data as it arrives
• Analyze granular, raw data
50% Cheaper to operate
• Significantly less computing resources
• No dual systems, costly ETL
• Elastically scalable on commodity hardware
• Point browser at: http://www.jethrodata.com/home
• Click • Register:
Try it Today
Jethro – Big Data Analytics. Real-Time.
Thank You!