indexed hive

27
www.persistentsys.com Indexed Hive A quick demonstration of Hive performance acceleration using indexes By: Prafulla Tekawade Nikhil Deshpande

Upload: nikhildeshpande

Post on 13-Jan-2015

17.013 views

Category:

Technology


6 download

DESCRIPTION

Accelerating Hive queries with indexes.

TRANSCRIPT

Page 1: Indexed Hive

www.persistentsys.com

Indexed Hive

A quick demonstration of Hive performance acceleration using indexes

By:

Prafulla Tekawade

Nikhil Deshpande

Page 2: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 2

Summary

• This presentation describes the performanceexperiment based on Hive using indexes to acceleratequery execution.• The slides include information on

• Indexes• A specific set of Group By queries• Rewrite technique• Performance experiment and results

Page 3: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 3

Hive usage

• HDFS spreads and scatters the data to different locations (data nodes).

• Data dumped & loaded into HDFS ‘as it is’.

• Only one view to the data, original data structure & layout

• Typically data is append-only

• Processing times dominated by full data scan times

Can the data access times be better?

Page 4: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 4

Hive usage

What can be done to speed-up queries?

Cut down the data I/O. Lesser data means faster processing.

Different ways to get performance

• Columnar storage

• Data partitioning

• Indexing (different view of same data)

• …

Page 5: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 5

Hive Indexing

• Provides key-based data view

• Keys data duplicated

• Storage layout favors search & lookup performance

• Provided better data access for certain operations

• A cheaper alternative to full data scans!

How cheap?

An order of magnitude better in certain cases!

Page 6: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 6

How does the index look like?

An index is a table with 3 columnshive> describe

default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx

__;

OK

l_shipdate string

_bucketname string

_offsets array<string>

Key

References to values

Data in index looks likehive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2;

OK

1992-01-08 hdfs://hadoop1:54310/user/…/lineitem.tbl ["662368"]

1992-01-16 hdfs://hadoop1:54310/user/…/lineitem.tbl ["143623","390763","637910"]

Page 7: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 7

Hive index in HQL

• SELECT (mapping, projection, association, given key, fetch value)

• WHERE (filters on keys)

• GROUP BY (grouping on keys)

• JOIN (join key as index key)

Indexes have high potential for accelerating wide range of queries.

Page 8: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 8

Hive Index

• Index as Reference

• Index as Data

This demonstration uses Index as Data technique to show order of magnitude performance gain!

• Uses Query Rewrite technique to transform queries on base table to index table.

• Limited applicability currently (e.g. demo based on GB) buttechnique itself has wide potential.

• Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications).

Page 9: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 9

Indexes and Query Rewrites

Demo targeting:

• GROUP BY, aggregation

• Index as Data

• Group By Key = Index Key

• Query rewritten to use indexes, but still a valid query (nothing special in it!)

Page 10: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 10

Query Rewrites: simple gb

SELECT DISTINCT l_shipdate

FROM lineitem;

SELECT l_shipdate

FROM __lineitem_shipdate_idx__;

Page 11: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 11

Query Rewrites: simple agg

SELECT l_shipdate, COUNT(1)

FROM lineitem

GROUP BY l_shipdate;

SELECT l_shipdate, size(`_offsets`)

FROM __lineitem_shipdate_idx__;

Page 12: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 12

Query Rewrites: gb + where

SELECT l_shipdate, COUNT(1)

FROM lineitem

WHERE YEAR(l_shipdate) >= 1992

AND YEAR(l_shipdate) <= 1996

GROUP BY l_shipdate;

SELECT l_shipdate, size(` _offsets `)

FROM __lineitem_shipdate_idx__

WHERE YEAR(l_shipdate) >= 1992

AND YEAR(l_shipdate) <= 1996;

Page 13: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 13

Query Rewrites: gb on func(key)

SELECT YEAR(l_shipdate) AS Year,

COUNT(1) AS Total

FROM lineitem

GROUP BY YEAR(l_shipdate);

SELECT Year, SUM(cnt) AS Total

FROM (SELECT YEAR(l_shipdate) AS Year,

size(`_offsets`) AS cnt

FROM __lineitem_shipdate_idx__) AS t

GROUP BY Year;

Page 14: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 14

Histogram Query

SELECT YEAR(l_shipdate) AS Year,

MONTH(l_shipdate) AS Month,

COUNT(1) AS Monthly_shipments

FROM lineitem

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate);

SELECT YEAR(l_shipdate) AS Year,

MONTH(l_shipdate) AS Month,

SUM(sz) AS Monthly_shipments

FROM (SELECT l_shipdate, SIZE(`_offsets`) AS sz

FROM __lineitem_shipdate_idx__) AS t

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate);

Page 15: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 15

Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments,

(y2_shipments-y1_shipments)/y1_shipments AS Delta

FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

COUNT(1) AS Shipments

FROM lineitem

WHERE YEAR(l_shipdate) = 1997

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1

JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

COUNT(1) AS Shipments

FROM lineitem

WHERE YEAR(l_shipdate) = 1998

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2

ON y1.Month = y2.Month;

Page 16: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 16

Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS y1_shipments,

y2.shipments AS y2_shipments,

( y2_shipments - y1_shipments ) / y1_shipments AS delta

FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

SUM(sz) AS shipments

FROM (SELECT l_shipdate, size(` _offsets `) AS sz

FROM __lineitem_shipdate_idx__) AS t1

WHERE YEAR(l_shipdate) = 1997

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1

JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

SUM(sz) AS shipments

FROM (SELECT l_shipdate, size(` _offsets `) AS sz

FROM __lineitem_shipdate_idx__) AS t

WHERE YEAR(l_shipdate) = 1998

GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2

ON y1.Month = y2.Month;

Page 17: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 17

Performance tests

Hardware and software configuration:

• 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in

RAID5, 16GB RAM)

• 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row-store format, HDFS replication factor: 2

• Hive development branch (~0.5)

• Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)

• Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g.

TPC-H 30GB data: 21GB lineitem, ~180Million tuples)

Page 18: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 18

Perf gain for Histogram Query

(sec) 1M 1G 10G 30G

q1_noidx 24.161 76.79 506.005 1551.555

q1_idx 21.268 27.292 35.502 86.133

Graphs

not to

scale

Page 19: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 19

Perf gain for Year on Year Query

(sec) 1M 1G 10G 30G

q1_noidx 73.66 130.587 764.619 2146.423

q1_idx 69.393 75.493 92.867 190.619

Graphs

not to

scale

Page 20: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 20

Why index performs better?

Reducing data increases I/O efficiency

If you need only X, separate X from the rest Lesser data to process, better memory footprint, better locality of reference…

Exploiting storage layout optimization

“Right tool for the job”, e.g. two ways to do GROUP BY

sort + agg or hash & agg

Sort step already done in index!

Parallelization

• Process the index data in same manner as base table, distribute the processing across nodes

• Scalable!

Page 21: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 21

Near-by future

More rewrites

Partitioning Index data per key.

Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution).

Optimizer support for index operators.

Cost based optimizer to choose index and non-index plans.

Page 22: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 22

Index Design

HDFS

Hadoop MR

HiveQuery Engine

Hive DDL

Engine

HiveQuery

Compiler

HiveDDL

CompilerIndex

Builder

QueryRewriteEngine

Page 23: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 23

Hive Compiler

Parser / AST Generator

Optimizer / Operator

Plan Generator Execution

Plan Generator

To

Hadoop

MR

SemanticAnalyzer

QueryRewriteEngine

Page 24: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 24

Query Rewrite Engine

Rewrite Rule

RewriteTrigger

Condition

RewriteAction

Rewrite Rules Repository

Rewrite Rule

RewriteTrigger

Condition

RewriteAction

Rewrite Rule

RewriteTrigger

Condition

RewriteAction

Rewrite Rule

RewriteTrigger

Condition

RewriteAction

Rewrite Rule

RewriteTrigger

Condition

RewriteAction

Rewrite Rule

RewriteTrigger

Condition

RewriteAction

Rule Engine

Query

Tree

Rewritten

Query Tree

Page 25: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 25

Learning Hive

• Hive compiler is not ‘Syntax Directed Translation’ driven• Tree visitor based, separation of data structs and compiler logic

• Tree is immutable (harder to change, harder to rewrite)

• Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries.

• Optimizer is not yet mature• Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table

scans)

• Optimizer is rule-based, not cost-based, no stats collected

• Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesn’t)

• Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls).

• Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk).

Page 26: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 26

How to get it?

• Needs a working Hadoop cluster (tested with 0.20.2)

• For the Hive with Indexing support:

• Hive Index DDL patch (JIRA 417) now part of hive trunk

https://issues.apache.org/jira/browse/HIVE-417

• Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all)

http://github.com/prafullat/hive

Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_and_building

See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test.

Page 27: Indexed Hive

www.persistentsys.com© 2010 Persistent Systems Ltd 27

Thank You!

prafulla_tekawade at persistent dot co dot in

nikhil_deshpande at persistent dot co dot in