hadoop, hbase and hive- bay area hadoop user group

25
Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook +

Upload: hadoop-user-group

Post on 27-Jan-2015

133 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hive/HBase Integrationor, MaybeSQL?April 2010

John Sichi

Facebook

+

Page 2: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Agenda

» Use Cases

» Architecture

» Storage Handler

» Load via INSERT

» Query Processing

» Bulk Load

» Q & A

Facebook

Page 3: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Motivations

» Data, data, and more data› 200 GB/day in March 2008 -> 12+ TB/day at the end of 2009

› About 8x increase per year

» Queries, queries, and more queries› More than 200 unique users querying per day

› 7500+ queries on production cluster per day; mixture of ad-hoc queries and ETL/reporting queries

» They want it all and they want it now› Users expect faster response time on fresher data

› Sampled subsets aren’t always good enough

Facebook

Page 4: Hadoop, Hbase and Hive- Bay area Hadoop User Group

How Can HBase Help?

» Replicate dimension tables from transactional databases with low latency and without sharding› (Fact data can stay in Hive since it is append-only)

» Only move changed rows› “Full scrape” is too slow and doesn’t scale as data keeps

growing

› Hive by itself is not good at row-level operations

» Integrate into Hive’s map/reduce query execution plans for full parallel distributed processing

» Multiversioning for snapshot consistency?

Facebook

Page 5: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Use Case 1: HBase As ETL Data Target

Facebook

HBaseHBaseHive INSERT… SELECT …Hive INSERT… SELECT …

SourceFiles/Tables

SourceFiles/Tables

Page 6: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Use Case 2: HBase As Data Source

Facebook

HBaseHBase

OtherFiles/Tables

OtherFiles/Tables

Hive SELECT … JOIN …

GROUP BY …

Hive SELECT … JOIN …

GROUP BY …

QueryResultQueryResult

Page 7: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Use Case 3: Low Latency Warehouse

Facebook

HBaseHBase

OtherFiles/Tables

OtherFiles/Tables

Periodic LoadPeriodic Load

Continuous UpdateContinuous Update

HiveQueries

HiveQueries

Page 8: Hadoop, Hbase and Hive- Bay area Hadoop User Group

HBase Architecture

Facebook From http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Page 9: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hive Architecture

Facebook

Page 10: Hadoop, Hbase and Hive- Bay area Hadoop User Group

All Together Now!

Facebook

Page 11: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hive CLI With HBase

» Minimum configuration needed:

hive \

--auxpath hive_hbasehandler.jar,hbase.jar,zookeeper.jar \

-hiveconf hbase.zookeeper.quorum=zk1,zk2…

hive> create table …

Facebook

Page 12: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Storage Handler

CREATE TABLE users(

userid int, name string, email string, notes string)

STORED BY

'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

“hbase.columns.mapping” =

“small:name,small:email,large:notes”)

TBLPROPERTIES (

“hbase.table.name” = “user_list”

);Facebook

Page 13: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Column Mapping

» First column in table is always the row key

» Other columns can be mapped to either:› An HBase column (any Hive type)

› An HBase column family (must be MAP type in Hive)

» Multiple Hive columns can map to the same HBase column or family

» Limitations› Currently no control over type mapping (always string in

HBase)

› Currently no way to map HBase timestamp attribute

Facebook

Page 14: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Load Via INSERT

INSERT OVERWRITE TABLE users

SELECT * FROM …; Hive task writes rows to HBase via

org.apache.hadoop.hbase.mapred.TableOutputFormat HBaseSerDe serializes rows into BatchUpdate objects

(currently all values are converted to strings) Multiple rows with same key -> only one row written Limitations

No write atomicity yetNo way to delete rowsWrite parallelism is query-dependent (map vs reduce)

Facebook

Page 15: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Map-Reduce Job for INSERT

Facebook

HBaseHBase

From http://blog.maxgarfinkel.com/wp-uploads/2010/02/mapreduceDIagram.png

Page 16: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Map-Only Job for INSERT

Facebook

HBaseHBase

Page 17: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Query Processing

SELECT name, notes FROM users WHERE userid=‘xyz’; Rows are read from HBase via

org.apache.hadoop.hbase.mapred.TableInputFormatBase HBase determines the splits (one per table region) HBaseSerDe produces lazy rows/maps for RowResults Column selection is pushed down Any SQL can be used (join, aggregation, union…) Limitations

Currently no filter pushdownHow do we achieve locality?

Facebook

Page 18: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Metastore Integration

» DDL can be used to create metadata in Hive and HBase simultaneously and consistently

» CREATE EXTERNAL TABLE: register existing Hbase table

» DROP TABLE: will drop HBase table too unless it was created as EXTERNAL

» Limitations› No two-phase-commit for DDL operations

› ALTER TABLE is not yet implemented

› Partitioning is not yet defined

› No secondary indexing

Facebook

Page 19: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Bulk Load

Ideally…

SET hive.hbase.bulk=true;

INSERT OVERWRITE TABLE users SELECT … ; But for now, you have to do some work and issue multiple

Hive commands1 Sample source data for range partitioning

2 Save sampling results to a file

3 Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner (sorts data, producing a large number of region files)

4 Import HFiles into HBase

5 HBase can merge files if necessary

Facebook

Page 20: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Range Partitioning During Sort

Facebook

A-G

H-Q

R-Z

HBaseHBase

(H)(R)

TotalOrderPartitionerloadtable.rb

Page 21: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Sampling Query For Range Partitioning

Given 5 million users in a table bucketed into 1000 buckets of 5000 users each, pick 9 user_ids which partition the set of all user_ids into 10 nearly-equal-sized ranges.

select user_id from

(select user_id

from hive_user_table

tablesample(bucket 1 out of 1000 on user_id) s

order by user_id) sorted_user_5k_sample

where (row_sequence() % 501)=0;

Facebook

Page 22: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Sorting Query For Bulk Load

set mapred.reduce.tasks=12;

set hive.mapred.partitioner=

org.apache.hadoop.mapred.lib.TotalOrderPartitioner;

set total.order.partitioner.path=/tmp/hb_range_key_list;

set hfile.compression=gz;

create table hbsort(user_id string, user_type string, ...)

stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat’

outputformat 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat’ tblproperties ('hfile.family.path' = '/tmp/hbsort/cf');

insert overwrite table hbsort

select user_id, user_type, createtime, …

from hive_user_table

cluster by user_id;

Facebook

Page 23: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Deployment

» Latest Hive trunk (will be in Hive 0.6.0)

» Requires Hadoop 0.20+

» Tested with HBase 0.20.3 and Zookeeper 3.2.2

» 20-node hbtest cluster at Facebook

» No performance numbers yet› Currently setting up tests with about 6TB (gz compressed)

Facebook

Page 24: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Questions?

» [email protected]

» [email protected]

» http://wiki.apache.org/hadoop/Hive/HBaseIntegration

» http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

» Special thanks to Samuel Guo for the early versions of the integration code

Facebook

Page 25: Hadoop, Hbase and Hive- Bay area Hadoop User Group

Hey, What About HBQL?

» HBQL focuses on providing a convenient language layer for managing and accessing individual HBase tables, and is not intended for heavy-duty SQL processing such as joins and aggregations

» HBQL is implemented via client-side calls, whereas Hive/HBase integration is implemented via map/reduce jobs

Facebook