apache hadoop india summit 2011 talk "hive evolution" by namit jain

Hive Evolution

Hadoop India Summit

February 2011

Namit Jain (Facebook)

Agenda

• Hive Overview• Version 0.6 (released!)• Version 0.7 (under development)• Hive is now a TLP!• Roadmaps

What is Hive?• A Hadoop-based system for querying

and managing structured data– Uses Map/Reduce for execution– Uses Hadoop Distributed File System

(HDFS) for storage

Hive Origins• Data explosion at Facebook• Traditional DBMS technology could

not keep up with the growth• Hadoop to the rescue!• Incubation with ASF, then became a

Hadoop sub-project• Now a top-level ASF project

SQL vs MapReducehive> select key, count(1) from kv1 where key >

100 group by key;

vs.$ cat > /tmp/reducer.shuniq -c | awk '{print $2"\t"$1}‘$ cat > /tmp/map.shawk -F '\001' '{if($1 > 100) print $1}‘$ bin/hadoop jar contrib/hadoop-0.19.2-dev-

streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Hive Evolution• Originally:

– a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs

• Now more and more:– A parallel SQL DBMS which happens to

use Hadoop for its storage and execution architecture

Intended Usage• Web-scale Big Data

– 100’s of terabytes• Large Hadoop cluster

– 100’s of nodes (heterogeneous OK)• Data has a schema• Batch jobs

– for both loads and queries

So Don’t Use Hive If…• Your data is measured in GB• You don’t want to impose a schema• You need responses in seconds• A “conventional” analytic DBMS can

already do the job– (and you can afford it)

• You don’t have a lot of time and smart people

Scaling Up• Facebook warehouse, Jan 2011:

– 2750 nodes– 30 petabytes disk space

• Data access per day:– ~40 terabytes added (compressed)– 25000 map/reduce jobs

• 300-400 users/month

Facebook Deployment

Web Servers Scribe MidTier

Production Hive-Hadoop Cluster

Sharded MySQL

Scribe-Hadoop Clusters

Adhoc Hive-Hadoop Cluster

Hive Replication

Archival Hive-Hadoop Cluster

System Architecture

Data Model

Hive Entity

Sample Metastore Entity

Sample HDFS Location

Table T /wh/T

Partition date=d1 /wh/T/date=d1

Bucketing column

userid

/wh/T/date=d1/part-0000…/wh/T/date=d1/part-1000(hashed on userid)

External Table

extT/wh2/existing/dir(arbitrary location)

Column Data Types

• Primitive Types• integer types, float, string, boolean

• Nest-able Collections• array<any-type>• map<primitive-type, any-type>

• User-defined types• structures with attributes which can be of any-

type

Hive Query Language• DDL

– {create/alter/drop} {table/view/partition}– create table as select

• DML– Insert overwrite

• QL– Sub-queries in from clause– Equi-joins (including Outer joins)– Multi-table Insert– Sampling– Lateral Views

• Interfaces– JDBC/ODBC/Thrift

Query Translation Example• SELECT url, count(*) FROM

page_views GROUP BY url• Map tasks compute partial counts for

each URL in a hash table– “map side” pre-aggregation– map outputs are partitioned by URL and

shipped to corresponding reducers• Reduce tasks tally up partial counts to

produce final results

FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20')SELECT subq1.gender, COUNT(1) GROUP BY subq1.genderINSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-

20')SELECT subq1.school, COUNT(1)GROUP BY subq1.school

It Gets Quite Complicated!

Behavior Extensibility• TRANSFORM scripts (any language)

– Serialization+IPC overhead• User defined functions (Java)

– In-process, lazy object evaluation• Pre/Post Hooks (Java)

– Statement validation/execution– Example uses: auditing, replication,

authorization, multiple clusters

Map/Reduce Scripts Examples

• add file page_url_to_id.py;• add file my_python_session_cutter.py;• FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);

UDF vs UDAF vs UDTF• User Defined Function

• One-to-one row mapping• Concat(‘foo’, ‘bar’)

• User Defined Aggregate Function• Many-to-one row mapping• Sum(num_ads)

• User Defined Table Function• One-to-many row mapping• Explode([1,2,3])

UDF Example• add jar build/ql/test/test-udfs.jar;• CREATE TEMPORARY FUNCTION testlength AS

'org.apache.hadoop.hive.ql.udf.UDFTestLength';• SELECT testlength(src.value) FROM src;• DROP TEMPORARY FUNCTION testlength;

• UDFTestLength.java:package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); }}

Storage Extensibility• Input/OutputFormat: file formats

– SequenceFile, RCFile, TextFile, …• SerDe: row formats

– Thrift, JSON, ProtocolBuffer, …• Storage Handlers (new in 0.6)

– Integrate foreign metadata, e.g. HBase• Indexing

– Under development in 0.7

Release 0.6• October 2010

– Views– Multiple Databases– Dynamic Partitioning– Automatic Merge– New Join Strategies– Storage Handlers

Dynamic Partitions

Automatically create partitions based on distinct values in columns

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

FROM page_view_stg pvs

Automatic merge• Jobs can produce many files• Why is this bad?

– Namenode pressure– Downstream jobs have to deal with file

processing overhead• So, clean up by merging results into a

few large files (configurable)– Use conditional map-only task to do this

Join Strategies• Old Join Strategies

– Map-reduce and Map Join• Bucketed map-join

– Allows “small” table to be much bigger• Sort Merge Map Join• Deal with skew in map/reduce join

– Conditional plan step for skewed keys

Storage Handler Syntax• HBase Example

CREATE TABLE users(

userid int, name string, email string, notes string)

STORED BY

'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

“hbase.columns.mapping” = “small:name,small:email,large:notes”)

TBLPROPERTIES (

“hbase.table.name” = “user_list”);

Release 0.7• Deployed in

Facebook– Stats Functions– Indexes– Local Mode– Automatic Map Join– Multiple DISTINCTs– Archiving

• In development– Concurrency Control– Stats Collection– J/ODBC Enhancements– Authorization– RCFile2– Partitioned Views– Security Enhancements

Statistical Functions• Stats 101

– Stddev, var, covar– Percentile_approx

• Data Mining– Ngrams, sentences (text analysis)– Histogram_numeric

• SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus

Histogram query results

• “It’s complicated” peaks at 18-19, but lasts into late 40s!• “In a relationship” peaks at 20• “Engaged” peaks at 25• Married peaks in early 30s• More married than single at 28• Only teenagers use widowed?

Pluggable Indexing• Reference implementation

– Index is stored in a normal Hive table– Compact: distinct block addresses– Partition-level rebuild

• Currently in R&D– Automatic use for WHERE, GROUP BY– New index types (e.g. bitmap, HBase)

Local Mode Execution• Avoids map/reduce cluster job latency• Good for jobs which process small

amounts of data• Let Hive decide when to use it

– set hive.exec.model.local.auto=true;• Or force its usage

– set mapred.job.tracker=local;

Automatic Map Join• Map-Join if small table fits in memory

– If it can’t, fall back to reduce join• Optimize hash table data structures• Use distributed cache to push out pre-

filtered lookup table– Avoid swamping HDFS with reads from

thousands of mappers

Multiple DISTINCT Aggs• Example

SELECT

view_date,

COUNT(DISTINCT userid),

COUNT(DISTINCT page_url)

FROM page_views

GROUP BY view_date

Archiving• Use HAR (Hadoop archive format) to

combine many files into a few• Relieves namenode memory

ALTER TABLE page_views

{ARCHIVE|UNARCHIVE}

PARTITION (ds=‘2010-10-30’)

Concurrency Control• Pluggable distributed lock manager

– Default is Zookeeper-based• Simple read/write locking• Table-level and partition-level• Implicit locking (statement level)

– Deadlock-free via lock ordering• Explicit LOCK TABLE (global)

Statistics Collection• Implicit metastore update during load

– Or explicit via ANALYZE TABLE• Table/partition-level

– Number of rows– Number of files– Size in bytes

Hive is now a TLP• PMC

– Namit Jain (chair)– John Sichi– Zheng Shao– Edward Capriolo– Raghotham Murthy

• Committers– Amareshwari Sriramadasu– Carl Steinbach

– Paul Yang– He Yongqiang– Prasad Chakka– Joydeep Sen Sarma– Ashish Thusoo– Ning Zhang

Developer Diversity• Recent Contributors

– Facebook, Yahoo, Cloudera– Netflix, Amazon, Media6Degrees, Intuit,

Persistent Systems– Numerous research projects– Many many more…

• Monthly San Francisco bay area contributor meetups

• India meetups ?

Roadmap: Heavy-Duty Tests• Unit tests are insufficient• What is needed:

– Real-world schemas/queries– Non-toy data scales– Scripted setup; configuration matrix– Correctness/performance verification– Automatic reports: throughput, latency,

profiles, coverage, perf counters…

Roadmap: Shared Test Site • Nightly runs, regression alerting• Performance trending• Synthetic workload (e.g. TPC-H)• Real-world workload (anonymized?)• This is critical for

– Non-subjective commit criteria– Release quality

Roadmap: New Features• Hive Server Stability/Deployment• File Concatenation

– Reduce Number of Files• Performance

– Bloom Filters– Push Down Filters

• Cost Based Optimizer– Column Level Statistics– Plan should be based on Statistics

Resources• http://hive.apache.org• user/[email protected]• [email protected]• Questions?

http://hive.apache.org/

apache hadoop india summit 2011 talk "hive evolution" by namit jain

Documents

table page

hadoop users

hadoopbased system

id sort

hive evolutionoriginally

table school

countryfrom page

table gender