Transcript
Page 1: May 2013 HUG: HCatalog/Hive Data Out

HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup

May 15, 2013

Page 2: May 2013 HUG: HCatalog/Hive Data Out

Moving Data Out of Hadoop Clusters Today

2 Yahoo! Presentation, Confidential

Client’s Machine

HTTP Client

HTTP Server

Launcher/ Gateway

HDFS Proxy1

HTTP Proxy

M/R on YARN

HDFS

Hadoop RPC

Hadoop RPC

SSH

HTTPS

HTTPS

M/R on YARN

Custom Proxy

HTTPS

HTTP Server

Filers

HTTPS

HDFS

M/R on YARN

DistCp

Clients Multi-tenant Hadoop Clusters Managed Data-loading

1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP

SSH

Page 3: May 2013 HUG: HCatalog/Hive Data Out

SQLLDR

Typical Data Out Scenario

3 Yahoo! Presentation, Confidential

HDFS Proxy HDFS

§  Data (to be pulled out) is stored in a predefined directory structure as files

§  Client determines (through a custom interface) if a particular data feed of interest is committed or not

§  If committed, client gets the list of files first, and then pulls them out (file-by-file) through HDFSProxy

Cus

tom

Inte

rface

Filer Temp Table

Main Table

cURL data copy

INSERT

Oracle DB

Ext. Table

Main Table

delimited files

Page 4: May 2013 HUG: HCatalog/Hive Data Out

Pros and Cons of the Data Out Approach

4 Yahoo! Presentation, Confidential

Pros

§  Security of DB passwords – password not stored in the grid

§  Compression – cross-colo network bandwidth is expensive and compression is not possible with JDBC drivers

§  Encryption – data out of the grids has to be encrypted as it may be cross-colo

§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy

Cons

§  Directory structure – has to be predefined and known to downstream consumers of data

§  Data discovery – availability of data for consumption requires polling or other hooks

§  Overhead – Use of DONE files

§  Maintenance – Separate schema files and schema file formats

The introduction of HCatalog and JMS notifications solves the problem

Page 5: May 2013 HUG: HCatalog/Hive Data Out

Hadoop – One Platform, Many Tools

Yahoo! Presentation, Confidential 5

Metastore HDFS

Hive

Metastore Client InputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

MapReduce/ Pig §  Pipelines §  Iterative Processing §  Research

Data Warehouse Hive §  BI Tools §  Analysis

Page 6: May 2013 HUG: HCatalog/Hive Data Out

HCatLoader/ HCatStorer

HCatalog – Opening Up the Hive Metastore

Yahoo! Presentation, Confidential 6

Metastore HDFS

Metastore Client InputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

Hive

REST

External System

Page 7: May 2013 HUG: HCatalog/Hive Data Out

HCatalog Value Proposition

Yahoo! Presentation, Confidential 7

Source: Alan Gates on HCatalog, Hadoop Summit, 2012

§  Centralized metadata service for Hadoop

§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for sharing of data

§  Provides DB-like abstractions (databases, tables, and partitions) and

supports schema evolution

§  Abstracts out the file storage format and data location

Page 8: May 2013 HUG: HCatalog/Hive Data Out

HiveServer2 with HCatalog

Yahoo! Presentation, Confidential 8

HDFS

(ODBC)

HiveServer2 (ODBC/ JDBC)

Data Out Client

(JDBC)

HCatalog Server (Metastore)

Messaging Service

(ActiveMQ)

HiveServer2 Jobs

Hive Jobs (CLI)

HCat Jobs (Pig, M/R)

doAs(user)

doAs(user)

JMS notification (Producer)

Notification (Consumer)

Page 9: May 2013 HUG: HCatalog/Hive Data Out

Issues Solved

9 Yahoo! Presentation, Confidential

Directory structure – has to be predefined and known to downstream consumers of data

Data discovery – availability of data for consumption requires polling or other hooks

Overhead – Use of DONE files

Maintenance – Separate schema files and schema file formats

✔ ✔

Page 10: May 2013 HUG: HCatalog/Hive Data Out

DataOut Motivation

10 Yahoo! Presentation, Confidential

§  Many ways to load and manage data on the grid §  HCatalog/Hive §  Pig §  Hadoop MR §  Sqoop §  GDM

§  Fewer ways of getting data off the cluster §  Sqoop §  HDFSProxy §  HDFS copy to local file system §  distcp between clusters

§  Challenges §  Underlying file format §  Size of data §  SLA

Page 11: May 2013 HUG: HCatalog/Hive Data Out

DataOut Overview

11 Yahoo! Presentation, Confidential

§  What is DataOut? §  Efficient method of moving data off the grid

§  API exposes a programmatic interface

§  What are the advantages of DataOut? §  API based on well-known JDBC API

§  Works with HCatalog/Hive

§  Agnostic to the underlying storage format

§  Parts of the whole data can be pulled in parallel

§  What are the limitations of DataOut?

§  Queries must be SELECT * FROM type queries

Page 12: May 2013 HUG: HCatalog/Hive Data Out

DataOut Deployment

12 Yahoo! Presentation, Confidential

HDFS

HS2 HS2 … HS2 HS2

DataOut Client

Query Data

Page 13: May 2013 HUG: HCatalog/Hive Data Out

How DataOut Works

13 Yahoo! Presentation, Confidential

HiveServer2 M

HiveSplit

S

FS/DB

HiveSplit

S

FS/DB

HiveSplit

S

FS/DB

Execute Query Prepare Splits

Fetch Splits

Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database

Page 14: May 2013 HUG: HCatalog/Hive Data Out

Code to Prepare the HiveSplits

14 Yahoo! Presentation, Confidential

DataOut  dataout  =  new  DataOut();  HiveConnection  c  =  dataout.getConnection();    Statement  s  =  c.createGenerateSplitStatement();  ResultSet  rs  =  s.executeQuery(sql);    while(rs.next())  {  

HiveSplit  split  =  (HiveSplit)  rs.getObject(1);  /*  Launch  job  to  fetch  the  split  data.  */  

}    /*  Synchronize  on  fetch  jobs.  */    rs.close();  s.close();  c.close();  

Page 15: May 2013 HUG: HCatalog/Hive Data Out

Code to Retrieve the HiveSplits

15 Yahoo! Presentation, Confidential

DataOut  dataout  =  new  DataOut();  HiveConnection  c  =  dataout.getConnection();    PreparedStatement  ps  =  c.prepareFetchSplitStatement(split);  ResultSet  rs  =  ps.executeQuery();    while(rs.next())  {  

/*  Process  row  data.  */  }    rs.close();  ps.close();  c.close();    /*  Communicate  with  master  process.  */  

Page 16: May 2013 HUG: HCatalog/Hive Data Out

DataOut Demo

Yahoo! Presentation, Confidential 16

Page 17: May 2013 HUG: HCatalog/Hive Data Out

HS2 Performance – Single Client Connection

17 Yahoo! Presentation, Confidential

Page 18: May 2013 HUG: HCatalog/Hive Data Out

HS2 Performance – Five Concurrent Clients

18 Yahoo! Presentation, Confidential

Page 19: May 2013 HUG: HCatalog/Hive Data Out

HS2 Performance Summary

19 Yahoo! Presentation, Confidential

§  Throughput scales linearly §  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s

§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s

§  Throughput is affected by fetch size §  Sweet spot around ~200 rows

§  Average row size may affect this number (pending further testing)

§  HiveServer2 is capable of handling multiple clients §  Throughput of 10GB in ~20 minutes with five client connections

§  Drop-off in throughput is expected and reasonable

§  5x increase in concurrent connections = 2x increase in transfer time

§  Goal of 50GB in 5min §  Achievable with ~10 HiveServer2 instances streaming data

Page 20: May 2013 HUG: HCatalog/Hive Data Out

Top Related