hiveserver2

18
HiveServer2 Oct., 2013 Schubert Zhang

Upload: schubert-zhang

Post on 29-Nov-2014

3.490 views

Category:

Technology


4 download

DESCRIPTION

HiveServer2

TRANSCRIPT

Page 1: HiveServer2

HiveServer2Oct., 2013

Schubert Zhang

Page 2: HiveServer2

Hive Evolution

• Original• Let users express their queries in a high-level language without having to

write MapReduce programs.• Mainly target to ad-hoc queries.• As a data tool, usually work in CLI mode.

• Now more …• A parallel SQL DBMS that happens to use Hadoop for its storage and

execution layers.• Ad-hoc + regular• As a service …

Page 3: HiveServer2

Introduction

• Limitations of HiveServer1• Concurrency • Security• Client Interface• Stability

• Sessions/Currency• Old Thrift API and server implementation

didn’t support currency.

• xDBC• Old Thrift API didn’t support common xDBC

• Authentication/Authorization• Incomplete implementations

• Auditing/Logging

HiveServer2:• From hive-0.11 / CDH4.1• Reconstructed and Re-implemented. (

HIVE-2935)

• HiveServer2 is a container for the Hive execution engine (Driver).

• For each client connection, it creates a new execution context (Connection and Session) that serves Hive SQL requests from the client.

• The new RPC interface enables the server to associate this Hive execution context with the thread serving the client’s request.

Page 4: HiveServer2

Architecture

System Arch.

Authentication Arch. (don’t talk here)

@Cloudera

In fact, Driver in Operation Context

http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/

Page 5: HiveServer2

Architecture: Internal

• Core Contexts• Connections• Sessions• Operations

• Operation Path …

hiveServer2(main entry)

thriftCLIService(TThreadPoolServer,

implements Client RPC Iface)

start

sessionManager(handleToSessionMap)

operationManager(handleToOperationMap)

backgroundOperationPool

cliService(Real implementations of

various operations)

Threads for Client Connections

Client-1

Client-2

call (ICLIService internal interface)

lIsten() and accept() new client connection, and process in each Thread)

Threads for Async Operations…

open/close sessions, run operations in existing sessions … HiveSession Interface

session

session

...

session

Thrift RPC Iface

HiveConf, SessionState

HiveConf, SessionState

...

HiveConf, SessionState

SQLopsync/async

op op ... op

create and run operations

SQLOp/SetOp/DfsOp/AddResourceOp/DeleteResourceOp .. GetTypeInfoOp/GetCatalogsOp/GetSchemasOp/GetTablesOp/

GetTableTypesOp/GetColumnsOp/GetFunctionsOp ...

runAsync

Hive Driver

create and run hive Driver

Page 6: HiveServer2

Architecture: Server Context

• Client• Connection (Thread)• Session (-> HiveConf, SessionState)• Operation (-> Driver)

• Usually, a client only opens one Session in a Connection. (refer to JDBC HiveDriver: HiveConnection)

Connection-1(Thread)

Client-1

Connection-2(Thread)

Client-2

Session-12 Session-11

Op-121(SQL)

Op-122Op-123(SQL)

Driver Driver

Page 7: HiveServer2

New Client API

• TCLIService.thrift• Complete API

• Complete Database API• Think about JDBC/ODBC• To be compatible with

existing DB software• Hive Specific API

• Best Practice• Client API vs. Internal API• Converting and Isolation

Session OpenSession Client request to open a new session. A new HiveSession is created in server and return a unique SessionHandler (UUID). All other calls depend on this session.

CloseSession Client request to close the session. Will also close and remove all operations in this session.

SQL and Hive Operation

ExecuteStatement Execute a HQL statement. SQLOpSome SQL statement can be tagged “runAsync”, then it will be executed in a dedicated Thread and return immediately.

Hive Command Operation

SetOp,DfsOp,AddResourceOp,DeleteResourceOp

DB Metadata Operation

GetInfo * Get various global variables of Hive. (Key-Type->Value)GetTypeInfo Get the detailed description and constraint of data type.GetCatalogs Do nothing so far.GetSchemas Get schema from metastore.GetTables Get table schema from metastore.GetTableTypes Get the table type, e.g. MANAGED_TABLE, EXTERNAL_TABLE,

VIRTUAL_VIEW, INDEX_TABLE.GetColumns Get columns of a table from metastore.GetFunctions Get the UDF functions.

Operation for Operation

GetOperationStatus

Get state of an operation by opHandler, INITIALIZED/ RUNNING/FINISHED/CANCELED/CLOSED/ERROR/UNKNOWN/PENDING.

CancelOperation Cancel a RUNNING or PENDING operation by opHandler. For SQLOp, do cleanup: close and destroy Hive Driver, delete temp output files, and cancel the task running in the background thread…

CloseOperation Remove this operation and close it: for SQLOp, do cleanup; for HiveCommandOp, tearDownSessionIO.

Get Result GetResultSetMetadata

Get the resultset’s schema, such as the title columns.

FetchResults Fetch the result rows from the real resultset.

Page 8: HiveServer2

Code

• Packages• org.apache.hive.service …, top project of apache…

• Pros• Clear Implementation• Decoupling of HiveServer2 and HiveCore• Decoupling of Thrift Client API and Internal Code

• Cons• Too many design pattern.• Somewhere, inconsistent principle.• Still not complete decoupling of HiveServer2 and HiveCore.• The JDBC Driver package/jar still relies on many other core code, such Hive->Hadoop and the libs…

(may be because of the support of Embedded Mode.)

Page 9: HiveServer2

HiveServer2

+main(): 入口

CompositeService

+serviceList

+addService()+removeService()

AbstractService

+HiveConf: Global,set by init()

Service

+state

+init()+start()+stop()+register(): StateChangeListener

CLIService

+sessionManager

ThrifyBinaryServiceThriftCLIService

+cliService

TCLIService.Iface

+OpenSession()+CloseSession()+GetInfo()+ExecuteStatement()+...()+FetchResults()

TThreadPoolServer

ICLIService

+openSession()+closeSession()+getInfo()+executeStatement()+...()+fetchResults()

SessionManager

+handleToSession: HashMap+operationManager+backgroundOperationPool

+openSession()+closeSession()+getSession()+...()+submitBackgroundOperation()

FixedThreadPool

OperationManager

+handleToOperation: HashMap

+newExecuteStatementOperation()+newGetTypeInfoOperation()+...()+addOperation()+removeOperation()+getOperation()+getOperationState()+cancelOperation()+closeOperation()+getOperationNextRowSet()+...()HiveSessionImpl

HiveSession

+sessionHandle+hiveConf: new for each+sessionState: new for each+opHandleSet

+getSessionHandle()+getInfo()+executeStatement()+executeStatementAsync()+...()+fetchResults()

Operation

+opHandle+parentSession+state

+getState()+setState()+run()+getNextRowSet()+close()+cancel()+...()

ExecuteStatementOperation

SQLOperation AddResourceOperation DeleteResourceOpetation DfsOperation SetOperation

GetSchemasOperationGetInfoOperation XXXOperation

Code

This is just a quick view, may be not exact in some detail, and intentionally missed something not so important.

Page 10: HiveServer2

HiveCore and DependingHive• HiveConf

• Global instance• Instance for each Session.

• Client can inject additional Key-Value style configurations when OpenSession.

• Set an explicit session name(id) to control the download directory name.

• Hive SessionState• Instance for each Session.

• Hive Driver• Instance for each SQL Operation.

• Global static variables?• ??

Env.?• SetOperation ->SetProcessor

• set env: variables can not be set.• set system: global

System.getProperties().setProperty(..)• We may forbid system setting? Or, only administrator can

do it?• set hiveconf: instanced.• set hivevar: instanced.• Set: instanced

• AddResource and DeleteResourceOperation• SessionState. add_resource/delete_resource• DOWNLOADED_RESOURCES_DIR("hive.downloaded.

resources.dir", System.getProperty("java.io.tmpdir") + File.separator + "${hive.session.id}_resources")

• DfsOperation• Auth. With HDFS?

Page 11: HiveServer2

Handler (Identifier)

• SessionHandler• OperationHandler

• Use UUID

Theift IDL:

struct THandleIdentifier { // 16 byte globally unique identifier // This is the public ID of the handle and // can be used for reporting. 1: required binary guid,

// 16 byte secret generated by the server // and used to verify that the handle is not // being hijacked by another user. 2: required binary secret,}

Now, only the public ID is used, it’s OK.

Page 12: HiveServer2

Configurations and Run

Config:• hive.server2.transport.mode = binary | http | https• hive.server2.thrift.port = 10000• hive.server2.thrift.bind.host• hive.server2.thrift.min.worker.threads = 5• hive.server2.thrift.max.worker.threads = 500• hive.server2.async.exec.threads = 50• hive.server2.async.exec.shutdown.timeout = 10

(seconds)

• hive.support.concurrency = true ???• hive.zookeeper.quorum = • …

Run:• Start HiveServer2• bin/hiveserver2 &

• Start CLI (use standard JDBC)• bin/beeline• !connect

jdbc:hive2://localhost:10000• show tables;• select * from tablename limit 10;

Page 13: HiveServer2

Interface and Clients

• RPC (TCLIService.thrift)• Binary Protocol• Http/https Protocol (to be researched)

• New JDBC Driver• org.apache.hive.jdbc.HiveDriver• URL: jdbc:hive2://hostname:10000/dbname… (jdbc:hive2://localhost:10000/default)• Implemented more API features.

3party Client over JDBC:• CLI

• Beeline based on SQLine

• IDE: SQuirreL SQL Client• Web Client (e.g. H2 Web, etc.)

Page 14: HiveServer2

Client Tools: CLI

SQLine, Beeline

Page 15: HiveServer2

Client Tools: IDE SQuirreL SQL Client

Page 16: HiveServer2

Client Tools: Web Client

Page 17: HiveServer2

Think More …

• Thinking of XX as Platform• Standard JDBC/ODBC• RESTful API over HTTP, Web Service

• AWS Redshift, SimpleDB …

• Hive as a Service? • http://www.qubole.com/• Request Cluster, run SQL ad-hoc and Regularly, workflow and schedule.

• Language• SQL, R, Pig

• Computing of Estimation, Probability …

Page 18: HiveServer2

Thank You!