may 2013 hug: hcatalog/hive data out

HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup

May 15, 2013

Moving Data Out of Hadoop Clusters Today

2 Yahoo! Presentation, Confidential

Client’s Machine

HTTP Client

HTTP Server

Launcher/ Gateway

HDFS Proxy1

HTTP Proxy

M/R on YARN

HDFS

Hadoop RPC

Hadoop RPC

SSH

HTTPS

HTTPS

M/R on YARN

Custom Proxy

HTTPS

HTTP Server

Filers

HTTPS

HDFS

M/R on YARN

DistCp

Clients Multi-tenant Hadoop Clusters Managed Data-loading

1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP

SSH

SQLLDR

Typical Data Out Scenario


HDFS Proxy HDFS

§  Data (to be pulled out) is stored in a predefined directory structure as files

§  Client determines (through a custom interface) if a particular data feed of interest is committed or not

§  If committed, client gets the list of files first, and then pulls them out (file-by-file) through HDFSProxy

Cus

tom

Inte

rface

Filer Temp Table

Main Table

cURL data copy

INSERT

Oracle DB

Ext. Table

Main Table

delimited files

Pros and Cons of the Data Out Approach


Pros

§  Security of DB passwords – password not stored in the grid

§  Compression – cross-colo network bandwidth is expensive and compression is not possible with JDBC drivers

§  Encryption – data out of the grids has to be encrypted as it may be cross-colo

§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy

Cons

§  Directory structure – has to be predefined and known to downstream consumers of data

§  Data discovery – availability of data for consumption requires polling or other hooks

§  Overhead – Use of DONE files

§  Maintenance – Separate schema files and schema file formats

The introduction of HCatalog and JMS notifications solves the problem

Hadoop – One Platform, Many Tools

Yahoo! Presentation, Confidential 5

Metastore HDFS

Hive

Metastore Client InputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

MapReduce/ Pig §  Pipelines §  Iterative Processing §  Research

Data Warehouse Hive §  BI Tools §  Analysis

HCatLoader/ HCatStorer

HCatalog – Opening Up the Hive Metastore


Metastore HDFS

Metastore Client InputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig


Hive

REST

External System

HCatalog Value Proposition



§  Centralized metadata service for Hadoop

§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for sharing of data

§  Provides DB-like abstractions (databases, tables, and partitions) and

supports schema evolution

§  Abstracts out the file storage format and data location

HiveServer2 with HCatalog


HDFS

(ODBC)

HiveServer2 (ODBC/ JDBC)

Data Out Client

(JDBC)

HCatalog Server (Metastore)

Messaging Service

(ActiveMQ)

HiveServer2 Jobs

Hive Jobs (CLI)

HCat Jobs (Pig, M/R)

doAs(user)

doAs(user)

JMS notification (Producer)

Notification (Consumer)

Issues Solved


Directory structure – has to be predefined and known to downstream consumers of data

Data discovery – availability of data for consumption requires polling or other hooks

Overhead – Use of DONE files

Maintenance – Separate schema files and schema file formats

✔

✔

✔ ✔

DataOut Motivation


§  Many ways to load and manage data on the grid §  HCatalog/Hive §  Pig §  Hadoop MR §  Sqoop §  GDM

§  Fewer ways of getting data off the cluster §  Sqoop §  HDFSProxy §  HDFS copy to local file system §  distcp between clusters

§  Challenges §  Underlying file format §  Size of data §  SLA

DataOut Overview


§  What is DataOut? §  Efficient method of moving data off the grid

§  API exposes a programmatic interface

§  What are the advantages of DataOut? §  API based on well-known JDBC API

§  Works with HCatalog/Hive

§  Agnostic to the underlying storage format

§  Parts of the whole data can be pulled in parallel

§  What are the limitations of DataOut?

§  Queries must be SELECT * FROM type queries

DataOut Deployment


HDFS

HS2 HS2 … HS2 HS2

DataOut Client

Query Data

How DataOut Works


HiveServer2 M

HiveSplit

S

FS/DB

HiveSplit

S

FS/DB

HiveSplit

S

FS/DB

Execute Query Prepare Splits

Fetch Splits

Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database

Code to Prepare the HiveSplits


DataOut dataout = new DataOut(); HiveConnection c = dataout.getConnection(); Statement s = c.createGenerateSplitStatement(); ResultSet rs = s.executeQuery(sql); while(rs.next()) {

HiveSplit split = (HiveSplit) rs.getObject(1); /* Launch job to fetch the split data. */

} /* Synchronize on fetch jobs. */ rs.close(); s.close(); c.close();

Code to Retrieve the HiveSplits


DataOut dataout = new DataOut(); HiveConnection c = dataout.getConnection(); PreparedStatement ps = c.prepareFetchSplitStatement(split); ResultSet rs = ps.executeQuery(); while(rs.next()) {

/* Process row data. */ } rs.close(); ps.close(); c.close(); /* Communicate with master process. */

DataOut Demo


HS2 Performance – Single Client Connection


HS2 Performance – Five Concurrent Clients


HS2 Performance Summary


§  Throughput scales linearly §  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s

§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s

§  Throughput is affected by fetch size §  Sweet spot around ~200 rows

§  Average row size may affect this number (pending further testing)

§  HiveServer2 is capable of handling multiple clients §  Throughput of 10GB in ~20 minutes with five client connections

§  Drop-off in throughput is expected and reasonable

§  5x increase in concurrent connections = 2x increase in transfer time

§  Goal of 50GB in 5min §  Achievable with ~10 HiveServer2 instances streaming data

may 2013 hug: hcatalog/hive data out

Technology

split data

forsharing of data

data location

sqlldrtypical data

confidentialhdfsproxyhdfs

process row data

confidentialdataout

particular data feed