hiveserver2
DESCRIPTION
HiveServer2TRANSCRIPT
HiveServer2Oct., 2013
Schubert Zhang
Hive Evolution
• Original• Let users express their queries in a high-level language without having to
write MapReduce programs.• Mainly target to ad-hoc queries.• As a data tool, usually work in CLI mode.
• Now more …• A parallel SQL DBMS that happens to use Hadoop for its storage and
execution layers.• Ad-hoc + regular• As a service …
Introduction
• Limitations of HiveServer1• Concurrency • Security• Client Interface• Stability
• Sessions/Currency• Old Thrift API and server implementation
didn’t support currency.
• xDBC• Old Thrift API didn’t support common xDBC
• Authentication/Authorization• Incomplete implementations
• Auditing/Logging
HiveServer2:• From hive-0.11 / CDH4.1• Reconstructed and Re-implemented. (
HIVE-2935)
• HiveServer2 is a container for the Hive execution engine (Driver).
• For each client connection, it creates a new execution context (Connection and Session) that serves Hive SQL requests from the client.
• The new RPC interface enables the server to associate this Hive execution context with the thread serving the client’s request.
Architecture
System Arch.
Authentication Arch. (don’t talk here)
@Cloudera
In fact, Driver in Operation Context
http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/
Architecture: Internal
• Core Contexts• Connections• Sessions• Operations
• Operation Path …
hiveServer2(main entry)
thriftCLIService(TThreadPoolServer,
implements Client RPC Iface)
start
sessionManager(handleToSessionMap)
operationManager(handleToOperationMap)
backgroundOperationPool
cliService(Real implementations of
various operations)
Threads for Client Connections
…
Client-1
Client-2
call (ICLIService internal interface)
lIsten() and accept() new client connection, and process in each Thread)
Threads for Async Operations…
open/close sessions, run operations in existing sessions … HiveSession Interface
session
session
...
session
Thrift RPC Iface
HiveConf, SessionState
HiveConf, SessionState
...
HiveConf, SessionState
SQLopsync/async
op op ... op
create and run operations
SQLOp/SetOp/DfsOp/AddResourceOp/DeleteResourceOp .. GetTypeInfoOp/GetCatalogsOp/GetSchemasOp/GetTablesOp/
GetTableTypesOp/GetColumnsOp/GetFunctionsOp ...
runAsync
Hive Driver
create and run hive Driver
Architecture: Server Context
• Client• Connection (Thread)• Session (-> HiveConf, SessionState)• Operation (-> Driver)
• Usually, a client only opens one Session in a Connection. (refer to JDBC HiveDriver: HiveConnection)
Connection-1(Thread)
Client-1
Connection-2(Thread)
Client-2
Session-12 Session-11
Op-121(SQL)
Op-122Op-123(SQL)
Driver Driver
New Client API
• TCLIService.thrift• Complete API
• Complete Database API• Think about JDBC/ODBC• To be compatible with
existing DB software• Hive Specific API
• Best Practice• Client API vs. Internal API• Converting and Isolation
Session OpenSession Client request to open a new session. A new HiveSession is created in server and return a unique SessionHandler (UUID). All other calls depend on this session.
CloseSession Client request to close the session. Will also close and remove all operations in this session.
SQL and Hive Operation
ExecuteStatement Execute a HQL statement. SQLOpSome SQL statement can be tagged “runAsync”, then it will be executed in a dedicated Thread and return immediately.
Hive Command Operation
SetOp,DfsOp,AddResourceOp,DeleteResourceOp
DB Metadata Operation
GetInfo * Get various global variables of Hive. (Key-Type->Value)GetTypeInfo Get the detailed description and constraint of data type.GetCatalogs Do nothing so far.GetSchemas Get schema from metastore.GetTables Get table schema from metastore.GetTableTypes Get the table type, e.g. MANAGED_TABLE, EXTERNAL_TABLE,
VIRTUAL_VIEW, INDEX_TABLE.GetColumns Get columns of a table from metastore.GetFunctions Get the UDF functions.
Operation for Operation
GetOperationStatus
Get state of an operation by opHandler, INITIALIZED/ RUNNING/FINISHED/CANCELED/CLOSED/ERROR/UNKNOWN/PENDING.
CancelOperation Cancel a RUNNING or PENDING operation by opHandler. For SQLOp, do cleanup: close and destroy Hive Driver, delete temp output files, and cancel the task running in the background thread…
CloseOperation Remove this operation and close it: for SQLOp, do cleanup; for HiveCommandOp, tearDownSessionIO.
Get Result GetResultSetMetadata
Get the resultset’s schema, such as the title columns.
FetchResults Fetch the result rows from the real resultset.
Code
• Packages• org.apache.hive.service …, top project of apache…
• Pros• Clear Implementation• Decoupling of HiveServer2 and HiveCore• Decoupling of Thrift Client API and Internal Code
• Cons• Too many design pattern.• Somewhere, inconsistent principle.• Still not complete decoupling of HiveServer2 and HiveCore.• The JDBC Driver package/jar still relies on many other core code, such Hive->Hadoop and the libs…
(may be because of the support of Embedded Mode.)
HiveServer2
+main(): 入口
CompositeService
+serviceList
+addService()+removeService()
AbstractService
+HiveConf: Global,set by init()
Service
+state
+init()+start()+stop()+register(): StateChangeListener
CLIService
+sessionManager
ThrifyBinaryServiceThriftCLIService
+cliService
TCLIService.Iface
+OpenSession()+CloseSession()+GetInfo()+ExecuteStatement()+...()+FetchResults()
TThreadPoolServer
ICLIService
+openSession()+closeSession()+getInfo()+executeStatement()+...()+fetchResults()
SessionManager
+handleToSession: HashMap+operationManager+backgroundOperationPool
+openSession()+closeSession()+getSession()+...()+submitBackgroundOperation()
FixedThreadPool
OperationManager
+handleToOperation: HashMap
+newExecuteStatementOperation()+newGetTypeInfoOperation()+...()+addOperation()+removeOperation()+getOperation()+getOperationState()+cancelOperation()+closeOperation()+getOperationNextRowSet()+...()HiveSessionImpl
HiveSession
+sessionHandle+hiveConf: new for each+sessionState: new for each+opHandleSet
+getSessionHandle()+getInfo()+executeStatement()+executeStatementAsync()+...()+fetchResults()
Operation
+opHandle+parentSession+state
+getState()+setState()+run()+getNextRowSet()+close()+cancel()+...()
ExecuteStatementOperation
SQLOperation AddResourceOperation DeleteResourceOpetation DfsOperation SetOperation
GetSchemasOperationGetInfoOperation XXXOperation
Code
This is just a quick view, may be not exact in some detail, and intentionally missed something not so important.
HiveCore and DependingHive• HiveConf
• Global instance• Instance for each Session.
• Client can inject additional Key-Value style configurations when OpenSession.
• Set an explicit session name(id) to control the download directory name.
• Hive SessionState• Instance for each Session.
• Hive Driver• Instance for each SQL Operation.
• Global static variables?• ??
Env.?• SetOperation ->SetProcessor
• set env: variables can not be set.• set system: global
System.getProperties().setProperty(..)• We may forbid system setting? Or, only administrator can
do it?• set hiveconf: instanced.• set hivevar: instanced.• Set: instanced
• AddResource and DeleteResourceOperation• SessionState. add_resource/delete_resource• DOWNLOADED_RESOURCES_DIR("hive.downloaded.
resources.dir", System.getProperty("java.io.tmpdir") + File.separator + "${hive.session.id}_resources")
• DfsOperation• Auth. With HDFS?
Handler (Identifier)
• SessionHandler• OperationHandler
• Use UUID
Theift IDL:
struct THandleIdentifier { // 16 byte globally unique identifier // This is the public ID of the handle and // can be used for reporting. 1: required binary guid,
// 16 byte secret generated by the server // and used to verify that the handle is not // being hijacked by another user. 2: required binary secret,}
Now, only the public ID is used, it’s OK.
Configurations and Run
Config:• hive.server2.transport.mode = binary | http | https• hive.server2.thrift.port = 10000• hive.server2.thrift.bind.host• hive.server2.thrift.min.worker.threads = 5• hive.server2.thrift.max.worker.threads = 500• hive.server2.async.exec.threads = 50• hive.server2.async.exec.shutdown.timeout = 10
(seconds)
• hive.support.concurrency = true ???• hive.zookeeper.quorum = • …
Run:• Start HiveServer2• bin/hiveserver2 &
• Start CLI (use standard JDBC)• bin/beeline• !connect
jdbc:hive2://localhost:10000• show tables;• select * from tablename limit 10;
Interface and Clients
• RPC (TCLIService.thrift)• Binary Protocol• Http/https Protocol (to be researched)
• New JDBC Driver• org.apache.hive.jdbc.HiveDriver• URL: jdbc:hive2://hostname:10000/dbname… (jdbc:hive2://localhost:10000/default)• Implemented more API features.
3party Client over JDBC:• CLI
• Beeline based on SQLine
• IDE: SQuirreL SQL Client• Web Client (e.g. H2 Web, etc.)
Client Tools: CLI
SQLine, Beeline
Client Tools: IDE SQuirreL SQL Client
Client Tools: Web Client
Think More …
• Thinking of XX as Platform• Standard JDBC/ODBC• RESTful API over HTTP, Web Service
• AWS Redshift, SimpleDB …
• Hive as a Service? • http://www.qubole.com/• Request Cluster, run SQL ad-hoc and Regularly, workflow and schedule.
• Language• SQL, R, Pig
• Computing of Estimation, Probability …
Thank You!