apache hadoop india summit 2011 talk "hive evolution" by namit jain
DESCRIPTION
TRANSCRIPT
Hive Evolution
Hadoop India Summit
February 2011
Namit Jain (Facebook)
Agenda
• Hive Overview• Version 0.6 (released!)• Version 0.7 (under development)• Hive is now a TLP!• Roadmaps
What is Hive?• A Hadoop-based system for querying
and managing structured data– Uses Map/Reduce for execution– Uses Hadoop Distributed File System
(HDFS) for storage
Hive Origins• Data explosion at Facebook• Traditional DBMS technology could
not keep up with the growth• Hadoop to the rescue!• Incubation with ASF, then became a
Hadoop sub-project• Now a top-level ASF project
SQL vs MapReducehive> select key, count(1) from kv1 where key >
100 group by key;
vs.$ cat > /tmp/reducer.shuniq -c | awk '{print $2"\t"$1}‘$ cat > /tmp/map.shawk -F '\001' '{if($1 > 100) print $1}‘$ bin/hadoop jar contrib/hadoop-0.19.2-dev-
streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
Hive Evolution• Originally:
– a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs
• Now more and more:– A parallel SQL DBMS which happens to
use Hadoop for its storage and execution architecture
Intended Usage• Web-scale Big Data
– 100’s of terabytes• Large Hadoop cluster
– 100’s of nodes (heterogeneous OK)• Data has a schema• Batch jobs
– for both loads and queries
So Don’t Use Hive If…• Your data is measured in GB• You don’t want to impose a schema• You need responses in seconds• A “conventional” analytic DBMS can
already do the job– (and you can afford it)
• You don’t have a lot of time and smart people
Scaling Up• Facebook warehouse, Jan 2011:
– 2750 nodes– 30 petabytes disk space
• Data access per day:– ~40 terabytes added (compressed)– 25000 map/reduce jobs
• 300-400 users/month
Facebook Deployment
Web Servers Scribe MidTier
Production Hive-Hadoop Cluster
Sharded MySQL
Scribe-Hadoop Clusters
Adhoc Hive-Hadoop Cluster
Hive Replication
Archival Hive-Hadoop Cluster
System Architecture
Data Model
Hive Entity
Sample Metastore Entity
Sample HDFS Location
Table T /wh/T
Partition date=d1 /wh/T/date=d1
Bucketing column
userid
/wh/T/date=d1/part-0000…/wh/T/date=d1/part-1000(hashed on userid)
External Table
extT/wh2/existing/dir(arbitrary location)
Column Data Types
• Primitive Types• integer types, float, string, boolean
• Nest-able Collections• array<any-type>• map<primitive-type, any-type>
• User-defined types• structures with attributes which can be of any-
type
Hive Query Language• DDL
– {create/alter/drop} {table/view/partition}– create table as select
• DML– Insert overwrite
• QL– Sub-queries in from clause– Equi-joins (including Outer joins)– Multi-table Insert– Sampling– Lateral Views
• Interfaces– JDBC/ODBC/Thrift
Query Translation Example• SELECT url, count(*) FROM
page_views GROUP BY url• Map tasks compute partial counts for
each URL in a hash table– “map side” pre-aggregation– map outputs are partitioned by URL and
shipped to corresponding reducers• Reduce tasks tally up partial counts to
produce final results
FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds='2009-03-20' ) ) subq1INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20')SELECT subq1.gender, COUNT(1) GROUP BY subq1.genderINSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-
20')SELECT subq1.school, COUNT(1)GROUP BY subq1.school
It Gets Quite Complicated!
Behavior Extensibility• TRANSFORM scripts (any language)
– Serialization+IPC overhead• User defined functions (Java)
– In-process, lazy object evaluation• Pre/Post Hooks (Java)
– Statement validation/execution– Example uses: auditing, replication,
authorization, multiple clusters
Map/Reduce Scripts Examples
• add file page_url_to_id.py;• add file my_python_session_cutter.py;• FROM (SELECT TRANSFORM(user_id, page_url, unix_time) USING 'page_url_to_id.py' AS (user_id, page_id, unix_time) FROM mylog DISTRIBUTE BY user_id SORT BY user_id, unix_time) mylog2 SELECT TRANSFORM(user_id, page_id, unix_time) USING 'my_python_session_cutter.py' AS (user_id, session_info);
UDF vs UDAF vs UDTF• User Defined Function
• One-to-one row mapping• Concat(‘foo’, ‘bar’)
• User Defined Aggregate Function• Many-to-one row mapping• Sum(num_ads)
• User Defined Table Function• One-to-many row mapping• Explode([1,2,3])
UDF Example• add jar build/ql/test/test-udfs.jar;• CREATE TEMPORARY FUNCTION testlength AS
'org.apache.hadoop.hive.ql.udf.UDFTestLength';• SELECT testlength(src.value) FROM src;• DROP TEMPORARY FUNCTION testlength;
• UDFTestLength.java:package org.apache.hadoop.hive.ql.udf; public class UDFTestLength extends UDF { public Integer evaluate(String s) { if (s == null) { return null; } return s.length(); }}
Storage Extensibility• Input/OutputFormat: file formats
– SequenceFile, RCFile, TextFile, …• SerDe: row formats
– Thrift, JSON, ProtocolBuffer, …• Storage Handlers (new in 0.6)
– Integrate foreign metadata, e.g. HBase• Indexing
– Under development in 0.7
Release 0.6• October 2010
– Views– Multiple Databases– Dynamic Partitioning– Automatic Merge– New Join Strategies– Storage Handlers
Dynamic Partitions
Automatically create partitions based on distinct values in columns
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
FROM page_view_stg pvs
Automatic merge• Jobs can produce many files• Why is this bad?
– Namenode pressure– Downstream jobs have to deal with file
processing overhead• So, clean up by merging results into a
few large files (configurable)– Use conditional map-only task to do this
Join Strategies• Old Join Strategies
– Map-reduce and Map Join• Bucketed map-join
– Allows “small” table to be much bigger• Sort Merge Map Join• Deal with skew in map/reduce join
– Conditional plan step for skewed keys
Storage Handler Syntax• HBase Example
CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” = “small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”);
Release 0.7• Deployed in
Facebook– Stats Functions– Indexes– Local Mode– Automatic Map Join– Multiple DISTINCTs– Archiving
• In development– Concurrency Control– Stats Collection– J/ODBC Enhancements– Authorization– RCFile2– Partitioned Views– Security Enhancements
Statistical Functions• Stats 101
– Stddev, var, covar– Percentile_approx
• Data Mining– Ngrams, sentences (text analysis)– Histogram_numeric
• SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
Histogram query results
• “It’s complicated” peaks at 18-19, but lasts into late 40s!• “In a relationship” peaks at 20• “Engaged” peaks at 25• Married peaks in early 30s• More married than single at 28• Only teenagers use widowed?
Pluggable Indexing• Reference implementation
– Index is stored in a normal Hive table– Compact: distinct block addresses– Partition-level rebuild
• Currently in R&D– Automatic use for WHERE, GROUP BY– New index types (e.g. bitmap, HBase)
Local Mode Execution• Avoids map/reduce cluster job latency• Good for jobs which process small
amounts of data• Let Hive decide when to use it
– set hive.exec.model.local.auto=true;• Or force its usage
– set mapred.job.tracker=local;
Automatic Map Join• Map-Join if small table fits in memory
– If it can’t, fall back to reduce join• Optimize hash table data structures• Use distributed cache to push out pre-
filtered lookup table– Avoid swamping HDFS with reads from
thousands of mappers
Multiple DISTINCT Aggs• Example
SELECT
view_date,
COUNT(DISTINCT userid),
COUNT(DISTINCT page_url)
FROM page_views
GROUP BY view_date
Archiving• Use HAR (Hadoop archive format) to
combine many files into a few• Relieves namenode memory
ALTER TABLE page_views
{ARCHIVE|UNARCHIVE}
PARTITION (ds=‘2010-10-30’)
Concurrency Control• Pluggable distributed lock manager
– Default is Zookeeper-based• Simple read/write locking• Table-level and partition-level• Implicit locking (statement level)
– Deadlock-free via lock ordering• Explicit LOCK TABLE (global)
Statistics Collection• Implicit metastore update during load
– Or explicit via ANALYZE TABLE• Table/partition-level
– Number of rows– Number of files– Size in bytes
Hive is now a TLP• PMC
– Namit Jain (chair)– John Sichi– Zheng Shao– Edward Capriolo– Raghotham Murthy
• Committers– Amareshwari Sriramadasu– Carl Steinbach
– Paul Yang– He Yongqiang– Prasad Chakka– Joydeep Sen Sarma– Ashish Thusoo– Ning Zhang
Developer Diversity• Recent Contributors
– Facebook, Yahoo, Cloudera– Netflix, Amazon, Media6Degrees, Intuit,
Persistent Systems– Numerous research projects– Many many more…
• Monthly San Francisco bay area contributor meetups
• India meetups ?
Roadmap: Heavy-Duty Tests• Unit tests are insufficient• What is needed:
– Real-world schemas/queries– Non-toy data scales– Scripted setup; configuration matrix– Correctness/performance verification– Automatic reports: throughput, latency,
profiles, coverage, perf counters…
Roadmap: Shared Test Site • Nightly runs, regression alerting• Performance trending• Synthetic workload (e.g. TPC-H)• Real-world workload (anonymized?)• This is critical for
– Non-subjective commit criteria– Release quality
Roadmap: New Features• Hive Server Stability/Deployment• File Concatenation
– Reduce Number of Files• Performance
– Bloom Filters– Push Down Filters
• Cost Based Optimizer– Column Level Statistics– Plan should be based on Statistics
Resources• http://hive.apache.org• user/[email protected]• [email protected]• Questions?