may 2013 hug: hcatalog/hive data out

20
HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup May 15, 2013

Upload: yahoo-developer-network

Post on 26-Jan-2015

107 views

Category:

Technology


1 download

DESCRIPTION

Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature. Presenter(s): Sumeet Singh, Director, Product Management, Yahoo! Chris Drome, Technical Yahoo!

TRANSCRIPT

Page 1: May 2013 HUG: HCatalog/Hive Data Out

HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup

May 15, 2013

Page 2: May 2013 HUG: HCatalog/Hive Data Out

Moving Data Out of Hadoop Clusters Today

2 Yahoo! Presentation, Confidential

Client’s Machine

HTTP Client

HTTP Server

Launcher/ Gateway

HDFS Proxy1

HTTP Proxy

M/R on YARN

HDFS

Hadoop RPC

Hadoop RPC

SSH

HTTPS

HTTPS

M/R on YARN

Custom Proxy

HTTPS

HTTP Server

Filers

HTTPS

HDFS

M/R on YARN

DistCp

Clients Multi-tenant Hadoop Clusters Managed Data-loading

1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP

SSH

Page 3: May 2013 HUG: HCatalog/Hive Data Out

SQLLDR

Typical Data Out Scenario

3 Yahoo! Presentation, Confidential

HDFS Proxy HDFS

§  Data (to be pulled out) is stored in a predefined directory structure as files

§  Client determines (through a custom interface) if a particular data feed of interest is committed or not

§  If committed, client gets the list of files first, and then pulls them out (file-by-file) through HDFSProxy

Cus

tom

Inte

rface

Filer Temp Table

Main Table

cURL data copy

INSERT

Oracle DB

Ext. Table

Main Table

delimited files

Page 4: May 2013 HUG: HCatalog/Hive Data Out

Pros and Cons of the Data Out Approach

4 Yahoo! Presentation, Confidential

Pros

§  Security of DB passwords – password not stored in the grid

§  Compression – cross-colo network bandwidth is expensive and compression is not possible with JDBC drivers

§  Encryption – data out of the grids has to be encrypted as it may be cross-colo

§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy

Cons

§  Directory structure – has to be predefined and known to downstream consumers of data

§  Data discovery – availability of data for consumption requires polling or other hooks

§  Overhead – Use of DONE files

§  Maintenance – Separate schema files and schema file formats

The introduction of HCatalog and JMS notifications solves the problem

Page 5: May 2013 HUG: HCatalog/Hive Data Out

Hadoop – One Platform, Many Tools

Yahoo! Presentation, Confidential 5

Metastore HDFS

Hive

Metastore Client InputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

MapReduce/ Pig §  Pipelines §  Iterative Processing §  Research

Data Warehouse Hive §  BI Tools §  Analysis

Page 6: May 2013 HUG: HCatalog/Hive Data Out

HCatLoader/ HCatStorer

HCatalog – Opening Up the Hive Metastore

Yahoo! Presentation, Confidential 6

Metastore HDFS

Metastore Client InputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

Hive

REST

External System

Page 7: May 2013 HUG: HCatalog/Hive Data Out

HCatalog Value Proposition

Yahoo! Presentation, Confidential 7

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

§  Centralized metadata service for Hadoop

§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for sharing of data

§  Provides DB-like abstractions (databases, tables, and partitions) and

supports schema evolution

§  Abstracts out the file storage format and data location

Page 8: May 2013 HUG: HCatalog/Hive Data Out

HiveServer2 with HCatalog

Yahoo! Presentation, Confidential 8

HDFS

(ODBC)

HiveServer2 (ODBC/ JDBC)

Data Out Client

(JDBC)

HCatalog Server (Metastore)

Messaging Service

(ActiveMQ)

HiveServer2 Jobs

Hive Jobs (CLI)

HCat Jobs (Pig, M/R)

doAs(user)

doAs(user)

JMS notification (Producer)

Notification (Consumer)

Page 9: May 2013 HUG: HCatalog/Hive Data Out

Issues Solved

9 Yahoo! Presentation, Confidential

Directory structure – has to be predefined and known to downstream consumers of data

Data discovery – availability of data for consumption requires polling or other hooks

Overhead – Use of DONE files

Maintenance – Separate schema files and schema file formats

✔ ✔

Page 10: May 2013 HUG: HCatalog/Hive Data Out

DataOut Motivation

10 Yahoo! Presentation, Confidential

§  Many ways to load and manage data on the grid §  HCatalog/Hive §  Pig §  Hadoop MR §  Sqoop §  GDM

§  Fewer ways of getting data off the cluster §  Sqoop §  HDFSProxy §  HDFS copy to local file system §  distcp between clusters

§  Challenges §  Underlying file format §  Size of data §  SLA

Page 11: May 2013 HUG: HCatalog/Hive Data Out

DataOut Overview

11 Yahoo! Presentation, Confidential

§  What is DataOut? §  Efficient method of moving data off the grid

§  API exposes a programmatic interface

§  What are the advantages of DataOut? §  API based on well-known JDBC API

§  Works with HCatalog/Hive

§  Agnostic to the underlying storage format

§  Parts of the whole data can be pulled in parallel

§  What are the limitations of DataOut?

§  Queries must be SELECT * FROM type queries

Page 12: May 2013 HUG: HCatalog/Hive Data Out

DataOut Deployment

12 Yahoo! Presentation, Confidential

HDFS

HS2 HS2 … HS2 HS2

DataOut Client

Query Data

Page 13: May 2013 HUG: HCatalog/Hive Data Out

How DataOut Works

13 Yahoo! Presentation, Confidential

HiveServer2 M

HiveSplit

S

FS/DB

HiveSplit

S

FS/DB

HiveSplit

S

FS/DB

Execute Query Prepare Splits

Fetch Splits

Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database

Page 14: May 2013 HUG: HCatalog/Hive Data Out

Code to Prepare the HiveSplits

14 Yahoo! Presentation, Confidential

DataOut  dataout  =  new  DataOut();  HiveConnection  c  =  dataout.getConnection();    Statement  s  =  c.createGenerateSplitStatement();  ResultSet  rs  =  s.executeQuery(sql);    while(rs.next())  {  

HiveSplit  split  =  (HiveSplit)  rs.getObject(1);  /*  Launch  job  to  fetch  the  split  data.  */  

}    /*  Synchronize  on  fetch  jobs.  */    rs.close();  s.close();  c.close();  

Page 15: May 2013 HUG: HCatalog/Hive Data Out

Code to Retrieve the HiveSplits

15 Yahoo! Presentation, Confidential

DataOut  dataout  =  new  DataOut();  HiveConnection  c  =  dataout.getConnection();    PreparedStatement  ps  =  c.prepareFetchSplitStatement(split);  ResultSet  rs  =  ps.executeQuery();    while(rs.next())  {  

/*  Process  row  data.  */  }    rs.close();  ps.close();  c.close();    /*  Communicate  with  master  process.  */  

Page 16: May 2013 HUG: HCatalog/Hive Data Out

DataOut Demo

Yahoo! Presentation, Confidential 16

Page 17: May 2013 HUG: HCatalog/Hive Data Out

HS2 Performance – Single Client Connection

17 Yahoo! Presentation, Confidential

Page 18: May 2013 HUG: HCatalog/Hive Data Out

HS2 Performance – Five Concurrent Clients

18 Yahoo! Presentation, Confidential

Page 19: May 2013 HUG: HCatalog/Hive Data Out

HS2 Performance Summary

19 Yahoo! Presentation, Confidential

§  Throughput scales linearly §  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s

§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s

§  Throughput is affected by fetch size §  Sweet spot around ~200 rows

§  Average row size may affect this number (pending further testing)

§  HiveServer2 is capable of handling multiple clients §  Throughput of 10GB in ~20 minutes with five client connections

§  Drop-off in throughput is expected and reasonable

§  5x increase in concurrent connections = 2x increase in transfer time

§  Goal of 50GB in 5min §  Achievable with ~10 HiveServer2 instances streaming data

Page 20: May 2013 HUG: HCatalog/Hive Data Out