sept 17 2013 - thug - hbase a technical introduction
DESCRIPTION
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).TRANSCRIPT
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Technical Deep Dive Sept 17 2013 – Toronto Hadoop User Group Adam Muise [email protected]
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Deep Dive Agenda
• Background – (how did we get here?)
• High-level Architecture – (where are we?)
• Anatomy of a RegionServer – (how does this thing work?)
• Using HBase – (where do we go from here?)
Page 2
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Background
Page 3
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
So what is a BigTable anyway?
• BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional
sorted map.” – http://research.google.com/archive/bigtable.html
• Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions
Page 4
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Modern Datasets Break Traditional Databases
Page 5
> 10x more always-connected mobile devices than seen in PC era. > Sensor, video and other machine generated data easily exceeds 100TB / day. > Traditional databases can’t serve modern application needs.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Apache HBase: The Database For Big Data
Page 6
More data is the key to richer application experiences and deeper insights.
With HBase you can: ü Ingest and retain more data, to petabyte scale and beyond. ü Store and access huge data volumes with low latency. ü Store data of any structure. ü Use the entire Hadoop ecosystem to gain deep insight on your data.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase At A Glance
Page 7
1
2
4
CLIENT LAYER
HBASE LAYER
HDFS LAYER
1 Clients automatically load balanced across the cluster.
2 Scales linearly to handle any load.
3 Data stored in HDFS allows automated failover.
4 Analyze data with any Hadoop tool.
3
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Real-Time Data on Hadoop
Page 8
> Read, Write, Process and Query data in real time using Hadoop infrastructure.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: High Availability
Page 9
> Data safely protected in HDFS. > Failed nodes are automatically recovered. > No single point of failure, no manual intervention.
HBase Node HBase Node
Replication Replication
HDFS HDFS HDFS
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Multi-Datacenter Replication
Page 10
> Replicate data to 2 or more datacenters. > Load balancing or disaster recovery.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Seamless Hadoop Integration
Page 11
> HBase makes deep analytics simple using any Hadoop tool. > Query with Hive, process with Pig, classify with Mahout.
HDFS
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Apache Hadoop in Review
• Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html
• Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – Line- or record-oriented processing of the entire dataset – “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004,
Dean and Ghemawat – http://research.google.com/archive/mapreduce.html
Page 12
For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
High-level Architecture
Page 13
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Architecture
• [Big]Tables consist of billions of rows, millions of columns
• Records ordered by rowkey – Inserts require sort, write-side overhead – Applications can take advantage of the sort
• Continuous sequences of rows partitioned into Regions – Regions partitioned at row boundary, according to size (bytes)
• Regions automatically split when they grow too large • Regions automatically distributed around the cluster
– ”Hands-free" partition management (mostly)
Page 14
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical ArchitectureDistributed, persistent partitions of a BigTable
ab
dc
ef
hg
ij
lk
mn
po
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7Table A, Region 1Table A, Region 2
Table G, Region 1070Table L, Region 25
Region Server 86Table A, Region 3Table C, Region 30Table F, Region 160Table F, Region 776
Region Server 367Table A, Region 4Table C, Region 17Table E, Region 52
Table P, Region 1116
Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.
Page 15
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Physical Architecture
• RegionServers collocate with DataNode – Tight MapReduce integration – Opportunity for data-local online processing via coprocessors
(experimental) • HBase Master process manages Region assignment • ZooKeeper configuration glue • Clients communicate directly with RegionServers (data
path) – Horizontally scale client load – Significantly harder for a single ignorant process to DOS the cluster
• DDL operations clients communicate with HBase Master • No persistent state in Master or ZooKeeper
– Recover from HDFS snapshot – See also: AWS Elastic MapReduce's HBase restore path
Page 16
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 17
Physical ArchitectureDistribution and Data Path
...
ZooKeeper
ZooKeeper
ZooKeeper
HBaseClient
JavaApp
HBaseClient
JavaApp
HBaseClient
HBase Shell
HBaseClient
REST/ThriftGateway
HBaseClient
JavaApp
HBaseClient
JavaApp
RegionServer
DataNode
RegionServer
DataNode
...
RegionServer
DataNode
RegionServer
DataNode
HBaseMaster
NameNode
Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Data Model
• Table as a sorted map of maps {rowkey => {family => {qualifier => {version => value}}}} – Think: nested OrderedDictionary (C#), TreeMap (Java)
• Basic data operations: GET, PUT, DELETE • SCAN over range of key-values
– benefit of the sorted rowkey business – this is how you implement any kind of "complex query”
• GET, SCAN support Filters – Push application logic to RegionServers
• INCREMENT, CheckAnd{Put,Delete} – Server-side, atomic data operations – Require read lock, can be contentious
• No: secondary indices, joins, multi-row transactions
Page 18
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 19
Logical Data ModelA sparse, multi-dimensional, sorted map
Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 71368394261 "hello"
"bar"
1368394583 221368394925 13.61368393847 "world"
"foo"
cf21368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey columnfamily
columnqualifier timestamp value
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Anatomy of a RegionServer
Page 20
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Storage Machinery • RegionServers host N Regions, as assigned by Master
– Common case, Region data is local to the RegionServer/DataNode • Each column family stored in isolation of others
– "column-family oriented” storage – NOT the same as column-oriented storage
• Key-values managed by "HStore” – combined view over data on disk + in-memory edits – region manages one HStore for each column family
• On disk: key-values stored sorted in "StoreFiles” – StoreFiles composed of ordered sequence of "Blocks” – also carries BloomFilter to minimize Block access
• In memory: "MemStore" maintains heap of recent edits – not to be confused with "BlockCache” – this structure is essentially a log-structured merge tree (LSM-tree)*
with MemStore C0 and StoreFiles C1
Page 21
* http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-Structured%20Merge-Tree%20%28LSM-Tree%29.pdf
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 22
RegionServer
HDFS
HLog(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
... ...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS.
Storage MachineryImplementing the data model
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Write Path (Storage Machinery cont.) • Write summary:
1. Log edit to HLog (WAL) 2. Record in MemStore 3. ACK write
• Data events recorded to a WAL on HDFS, for durability – After fails, edits in WAL are replayed during recovery – WAL appends are immediate, in critical write-path
• Data collected in "MemStore", until a "flush" writes new HFiles – Flush is automatic, based on configuration (size, or staleness interval) – Flush clears WAL entries corresponding to MemStore entries – Flush is deferred, not in critical write-path
• HFiles are merge-sorted during "Compaction” – Small files compacted into larger files – old records discarded (major compaction only) – Lots of disk and network IO
Page 23
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 24
RegionServer
HDFS
HLog(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
... ...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend: 1. A MutateRequest is received by the RegionServer. 2. A WALEdit is appended to the HLog. 3. The new KeyValues are written to the MemStore. 4. The RegionServer acknowledges the edit with a MutateResponse.
Write PathStoring a KeyValue1
2
3
4
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Read Path (Storage Machinery, cont.) • Read summary:
1. Evaluate query predicate 2. Materialize results from Stores 3. Batch results to client
• Scanners opened over all relevant StoreFiles + MemStore – “BlockCache” maintains recently accessed Blocks in memory – BloomFilter used to skip irrelevant Blocks – Predicate matchs accumulate, sorted, return ordered rows
• Same Scanner APIs used for GET and SCAN – Different access patterns, different optimization strategies – SCAN:
– HDFS optimized for throughput of long sequential reads – Consider larger Block size for more data per seek
– GET: – BlockCache maintains hot Blocks for point access (GET) – Consider more granular BloomFilter
Page 25
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 26
RegionServer
HDFS
HLog(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
... ...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend: 1. A GetRequest is received by the RegionServer. 2. StoreScanners are opened over appropriate StoreFiles and the MemStore. 3. Blocks identified as potential matches are read from HDFS if not already in the BlockCache. 4. KeyValues are merged into the final set of Results. 5. A GetResponse containing the Results is returned to the client.
Read PathServing a single read request1 5
23
3
2
4
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Using HBase
Page 27
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
For what kinds of workloads is it well suited?
• It depends on how you tune it, but… • HBase is good for:
– Large datasets – Sparse datasets – Loosely coupled (denormalized) records – Lots of concurrent clients
• Try to avoid: – Small datasets (unless you have *lots* of them) – Highly relational records – Schema designs requiring transactions
Page 28
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Use Cases
Page 29
Flexible Schema
Huge Data Volume
High Read Rate High Write Rate
Machine-‐Generated Data
Distributed Messaging
Real-‐Time Analy@cs
Object Store
User Profile Management
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hbase Example Use Case: Major Hard Drive Manufacturer
Page 30
• Goal: detect defective drives before they leave the factory.
• Solution: – Stream sensor data to HBase as it is generated by their test
battery. – Perform real-time analysis as data is added and deep analytics
offline.
• HBase a perfect fit: – Scalable enough to accommodate all 250+ TB of data needed. – Seamless integration with Hadoop analytics tools.
• Result: – Went from processing only 5% of drive test data to 100%.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Example HBase Use Cases
• Facebook messaging and counts • Time series data • Exposing Machine Learning models (like risk sets) • Large message set store and forward, especially in social media
• Geospatial indexing • Indexing the Internet
Page 31
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
How does it integrate with my infrastructure?
• Horizontally scale application data – Highly concurrent, read/write access – Consistent, persisted shared state – Distributed online data processing via Coprocessors
(experimental)
• Gateway between online services and offline storage/analysis – Staging area to receive new data – Serve online “views” on datasets in HDFS – Glue between batch (HDFS, MR1) and online (CEP, Storm)
systems
Page 32
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
What data semantics does it provide?
• GET, PUT, DELETE key-value operations • SCAN for queries • INCREMENT, CAS server-side atomic operations • Row-level write atomicity • MapReduce integration
Page 33
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Creating a table in HBase #!/bin/sh # Small script to setup the hbase table used by OpenTSDB. test -‐n "$HBASE_HOME" || { #A echo >&2 'The environment variable HBASE_HOME must be set' exit 1 } test -‐d "$HBASE_HOME" || { echo >&2 "No such directory: HBASE_HOME=$HBASE_HOME" exit 1 } TSDB_TABLE=${TSDB_TABLE-‐'tsdb'} UID_TABLE=${UID_TABLE-‐'tsdb-‐uid'} COMPRESSION=${COMPRESSION-‐'LZO'} exec "$HBASE_HOME/bin/hbase" shell <<EOF create '$UID_TABLE', #B {NAME => 'id', COMPRESSION => '$COMPRESSION'}, #B {NAME => 'name', COMPRESSION => '$COMPRESSION'} #B create '$TSDB_TABLE', #C {NAME => 't', COMPRESSION => '$COMPRESSION'} #C EOF #A From environment, not parameter #B Make the tsdb-‐uid table with column families id and name #C Make the tsdb table with the t column family #Script taken from HBase in Action -‐ Chapter 7
Page 34
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Coprocessors in a nutshell
• Two types of coprocessors: Observer and Endpoints • Coprocessors are java code executed in each region server • Observer
– Similar to a database trigger – Available Observer types: RegionObserver, WALObserver, MasterObserver – Mainly used to extend pre/post logic within region server events, WAL events, or
DDL events
• Endpoint – Sort of like a UDF – Extend HBase client API to make functions exposed to a user – Still executed on RegionServer – Often used for sums/aggregations (HBase packs in an aggregate example)
• BE VERY CAREFUL WITH COPROCESSORS – They run in your region servers and buggy code can take down your cluster – See HOYA details to help mitigate risk
Page 35
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
What about operational concerns? • Balance memory and IO for reads
– Contention between random and sequential access – Configure Block size, BlockCache based on access patterns – Additional resources
– “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners
– “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html
• Balance IO for writes – Provision hardware with more spindles/TB – Configure L1 (compactions, region size, &c.) based on write pattern – Balance contention between maintaining L1 and serving reads – Additional resources
– “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
– “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/
Page 36
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Operational Tidbits
• Decommissioning Nodes will result in a downed server, use “graceful_stop.sh” to offload the workload from the region server
• Use the “zk_dump” to find all of your region servers and how your zookeeper instances are faring
• Use “status ‘summary’” or “status ‘detailed’” for a count of live/dead servers, average load, and file counts
• User “balancer” to automatically balance regions if HBase is set to auto-balance
• When using “hbase hbck” to diagnose and fix issues, RTFM!
Page 37
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
SQL and HBase Hive and Phoenix over HBase
Page 38
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phoenix over HBase
• Phoenix is a SQL shim over HBase • https://github.com/forcedotcom/phoenix • Hbase has fast write capabilities to Phoenix allows for fast simple query (no joins) and fast upserts
• Phoenix implements it’s own JDBC driver so you can use your favorite tools
Page 39
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013
Phoenix over HBase
Page 40
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive over HBase
• Hive can be used directly with HBase • Hive uses the MapReduce InputFormat “HBaseStorageHandler” to query from the table
• Storage Handler has hooks for – Getting input / output formats – Meta data operations hook: CREATE TABLE, DROP TABLE, etc
• Storage Handler is a table level concept – Does not support Hive partitions, and buckets
• Hive does not need to include all columns from HBase table
Page 41
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013
Hive over HBase
Page 42
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive over HBase
Page 43
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive and Phoenix over HBase > hive add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar; add jar /usr/lib/hbase/lib/zookeeper.jar; add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar; add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop; CREATE EXTERNAL TABLE phoenix_mobilelograw( key string, ip string, ts string, code string, d1 string, d2 string, d3 string, d4 string, properties string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES ("hbase.table.name" = "MOBILELOGRAW”); set hive.hbase.wal.enabled=false; INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set hive.hbase.wal.enabled=true;
Page 44
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hbase Roadmap
Page 45
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hortonworks Focus Areas for HBase
Page 46
• Simplified Operations: • Intelligent Compaction • Automated Rebalancing
• Ambari Management: • Snapshot / Revert • Multimaster HA • Cross-site Replication • Backup / Restore
• Ambari Monitoring: • Latency metrics • Throughput metrics • Heatmaps • Region visualizations
Simplified Operations Database Functionality
• First-Class Datatypes • SQL Interface Support • Indexes • Security
• Encryption • More Granular Permissions
• Performance: • Stripe Compactions • Short Circuit Read for
Hadoop 2 • Row and Entity Groups • Deeper Hive/Pig Interop
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Roadmap Details: Operations
Page 47
• Snapshots: – Protect data or restore to a point in time.
• Intelligent Compaction: – Compact when the system is lightly utilized. – Avoid “compaction storms” that can break SLAs.
• Ambari Operational Improvements: – Configure multi-master HA. – Simple setup/configuration for replication. – Manage and schedule snapshots. – More visualizations, more health checks.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Roadmap Details: Data Management
Page 48
• Datatypes: – First-class datatypes offer performance benefits and better
interoperability with tools and other databases.
• SQL Interface (Preview): – SQL interface for simplified analysis of data within HBase. – JDBC driver allows embedding in existing applications.
• Security: – Granular permissions on data within HBase.
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HOYA HBase On YARN
Page 49
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
HOYA?
• The new YARN resource negotiation layer in Hadoop allows for non-mapreduce applications to run on a Hadoop grid, why not allow HBase to take advantage of this capability?
• https://github.com/hortonworks/hoya/ • HOYA is a YARN application that provisions regionservers based
on an HBase cluster configuration • HOYA helps to bring HBase into YARN resource management
and paves the way for advanced resource management with HBase
• HOYA can be used to spin up temporary HBase clusters temporarily during MapReduce or other jobs
Page 50
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
A quick YARN refresher…
Page 51
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013
The 1st Generation of Hadoop: Batch
HADOOP 1.0 Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
• All other usage patterns must leverage that same infrastructure
• Forces the creation of silos for managing mixed workloads
Single App
BATCH
HDFS
Single App
ONLINE
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013
A Transition From Hadoop 1 to 2
HADOOP 1.0
HDFS (redundant, reliable storage)
MapReduce (cluster resource management
& data processing)
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013
A Transition From Hadoop 1 to 2
HADOOP 1.0
HDFS (redundant, reliable storage)
MapReduce (cluster resource management
& data processing)
HDFS (redundant, reliable storage)
YARN (cluster resource management)
MapReduce (data processing)
Others (data processing)
HADOOP 2.0
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
The Enterprise Requirement: Beyond Batch
To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service
Page 55
HDFS (Redundant, Reliable Storage)
BATCH INTERACTIVE STREAMING GRAPH IN-‐MEMORY HPC MPI ONLINE OTHER
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
YARN: Taking Hadoop Beyond Batch
• Created to manage resource needs across all uses
• Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON”
– Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc.
Page 56
ApplicaDons Run NaDvely IN Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH (MapReduce)
INTERACTIVE (Tez)
STREAMING (Storm, S4,…)
GRAPH (Giraph)
IN-‐MEMORY (Spark)
HPC MPI (OpenMPI)
ONLINE (HBase)
OTHER (Search) (Weave…)
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. © Hortonworks Inc. 2013
HOYA Architecture
Page 57
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Key HOYA Design Goals
1. Create on-demand HBase clusters 2. Maintain multiple HBase cluster configurations and
implement them as required (i.e. high-load scenarios)
3. Isolation – Sandbox clusters running different versions of HBase or with different coprocessors
4. Create transient HBase clusters for MapReduce or other processing
5. Elasticity of clusters for analytics, data-ingest, project-based work
6. Leverage the scheduling in YARN to ensure HBase can be a good Hadoop cluster tenant
Page 58
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 59
Time to call it an evening. We all have important work to do…
Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Thank you….
Page 60
hbaseinaction.com
For more information, check out HBase: The Definitive Guide Or HBase in Action