dchug m7-30 apr2013
DESCRIPTION
DC HUG Apr30 2013 - HBase and MapR M7TRANSCRIPT
1 ©MapR Technologies
HBase and M7 Technical Overview Jim Fiori Senior Solu8ons Architect MapR Technologies
April2013
2 ©MapR Technologies
§ Background § “a 3-‐hour tour” § Early Hadoop fire-‐fight § Big Data
Who am I?
3 ©MapR Technologies
Apache HBase MapR M7
Agenda
4 ©MapR Technologies
HBase Google BigTable Paper -‐ 2006 A sparse, distributed, persistent, indexed, and
sorted map OR
A NoSQL database OR
A Columnar data store
5 ©MapR Technologies
Key-‐Value Store
§ Row key – Binary sortable value
§ Row content key (analogous to a column) – Column family (string) – Column qualifier (binary) – Version/8mestamp (number)
§ A row key, column family, column qualifier, and version uniquely iden8fies a par8cular cell – A cell contains a single binary value
6 ©MapR Technologies
A Row
Value1 Row Key Value2 Value3 Value4 ValueN …
C0 C1 C2 C3 C4 CN
Column Family Row Key Column
Qualifier Version Value2
Column Family Row Key Column
Qualifier Version Value1
Column Family Row Key Column
Qualifier Version ValueN
…
7 ©MapR Technologies
§ Weakly typed and schema-‐less (unstructured or perhaps semi-‐structured) – Almost everything is binary
§ No constraints – You can put any binary value in any cell – You can even put incompa8ble types in two different instances of the same column family:column qualifier
§ Column (qualifiers) are created implicitly
§ Different rows can have different columns § No transac8ons/no ACID – Only unit of atomic opera8on is a single row
Not A TradiDonal RDBMS
8 ©MapR Technologies
§ APIs for querying (get), scanning, and upda8ng (put) – Operate on row key, column family, qualifier, version, and values – Can par8ally specify and will retrieve union of results • if just specify row key, will get all values for it (with column family, qualifier) – By default only largest version (most recent if 8mestamp) is returned
• Specify row key and column family to get will retrieve all values for that row and column family
– Scanning is just get over a range of row keys
§ Version – While defaults to a 8mestamp, any integer is acceptable
API
9 ©MapR Technologies
§ Rather than storing table rows linearly on disk and each row on disk as a single byte range with fixed size fields, store columns of row separately – Very efficient storage for sparse data sets (NULL is free) – Compression works beker on similar data – Fetches of only subsets of row very efficient (less disk IO) – No fixed size on column values – No requirement to even define columns
§ Columns are grouped together into column families – Basically a file on disk – A unit of op8miza8on – In Hbase, adding column is implicit, adding column family is explicit
Columnar
10 ©MapR Technologies
HBase Table Architecture § Tables are divided into key ranges (regions) § Regions are served by nodes (RegionServers) § Columns are divided into access groups (columns families)
CF1 CF2 CF3 CF4 CF5
R1
R2
R3
R4
11 ©MapR Technologies
HBase Architecture
12 ©MapR Technologies
§ Data is stored in sorted order – A table contains rows – A sequence of rows are grouped together into a region • A region consists of various files related to those rows and is loaded into a region server
• Regions are stored in HDFS for high availability – A single region server manages mul8ple regions • Region assignment can change – load balancing, failures, etc.
§ Clients connect to tables – HBase run8me transparently determines the region (based on key ranges) and contacts the appropriate region server
§ At any given 8me exactly one region server provides access to a region – Master region servers (with Zookeeper) manage that
Storage Model Highlights
13 ©MapR Technologies
§ Very scalable § Easy to add region servers § Easy to move regions around § Scans are efficient – Unlike hashing based models
§ Access via row key is very efficient – Note: there are no secondary indexes
§ No schema, can store whatever you want when you want § Strong consistency
§ Integrated with Hadoop – Map-‐Reduce on HBase is straighoorward – HDFS/MapR-‐FS provides data replica8on
What’s Great About This?
14 ©MapR Technologies
§ Data from a region column family is stored in an HFile – An HFile contains row key:column qualifier:version:value entries
– Index at the end into the data – 64KB “blocks” by default § Update – New value is wriken persistently to Write Ahead Log (WAL) – Cached in memory (MemStore) – When memory fills, write out new HFile
§ Read – Checks in memory, then all of the HFiles – Read data cached in memory
§ Delete – Create a tombstone record (purged at major compac8on)
Data Storage Architecture
15 ©MapR Technologies
Apache HBase HFile Structure
64Kbyte blocks are compressed
An index into the compressed blocks is created as a btree
Key-‐value pairs are laid out in increasing order
Each cell is an individual key + value -‐ a row repeats the key for each column
16 ©MapR Technologies
HBase Region OperaDon
§ Typical region size is a few GB, some8mes even 10G or 20G § RS holds data in memory in a MemStore un8l full, then writes a new HFile – Logical view of database constructed by layering these files, with the latest on top
Key range represented by this region
newest
oldest
17 ©MapR Technologies
HBase Read AmplificaDon § When a get/scan comes in, all the files have to be examined – schema-‐less, so where is the column? – Done in-‐memory and does not change what's on disk • Bloom-‐filters do not help in scans
newest
oldest
With 7 files, a 1K-‐record get() poten8ally takes about 30 seeks, 7 block fetches and decompressions, from HDFS. Even with the index in memory 7 seeks and 7 block fetches are required.
18 ©MapR Technologies
HBase Write AmplificaDon
§ To reduce the read-‐amplifica8on, HBase merges the HFiles periodically – process called compac8on – runs automa8cally when too many files – usually turned off due to I/O storms which interfere with client access
– and kicked-‐off manually on weekends
Major compac8on reads all files and merges into a single HFile
20 ©MapR Technologies
§ A persistent record of every update/insert in sequence order – Shared by all regions on one region server – WAL files periodically rolled to limit size but older WALs s8ll needed – WAL file no longer needed once every region with updates in WAL file has flushed those from memory to an HFile • Remember that more HFiles slow read path!
§ Must be replayed as part of recovery process since in memory updates are “lost” – This is very expensive and delays bringing a region back online
WAL File
21 ©MapR Technologies
What’s Not So Good
Reliability • Complex coordina8on between ZK, HDFS, HBase Master, and Region Server during region movement
• Compac8ons disrupt opera8ons • Very slow crash recovery because of • Coordina8on complexity • WAL log reading (one log/server)
Business conDnuity • Many administra8ve ac8ons require down8me • Not well integrated into MapR-‐FS mirroring and snapshot func8onality
22 ©MapR Technologies
What’s Not So Good
Performance • Very long read/write path • Significant read and write amplifica8on • Mul8ple JVMs in read/write path – GC delays!
Manageability • Compac8ons, splits and merges must be done manually (in reality)
• Lots of “well known” problems maintaining reliable cluster – spliwng, compac8ons, region assignment, etc.
• Prac8cal limits on number of regions/region server and size of regions – can make it hard to fully u8lize hardware
23 ©MapR Technologies
Region Assignment in Apache HBase
24 ©MapR Technologies
Apache HBase on MapR
Limited data management, data protec8on and disaster recovery for tables.
25 ©MapR Technologies
HBase MapR M7 Containers
Agenda
27 ©MapR Technologies
MapR DistribuDon for Apache Hadoop
§ Complete Hadoop distribu8on
§ Comprehensive management suite
§ Industry-‐standard interfaces
§ Enterprise-‐grade dependability
§ Higher performance
28 ©MapR Technologies
MapR: The Enterprise Grade DistribuDon
29 ©MapR Technologies
One PlaVorm for Big Data
…
Batch
99.999% HA
Data Protec8on
Disaster Recovery
Scalability &
Performance Enterprise Integra8on
Mul8-‐tenancy
Map Reduce
File-‐Based Applica8ons SQL Database Search Stream
Processing
Interac8ve Real-‐8me
… Broad range of
applica8ons
Recommenda8on Engines Fraud Detec8on Billing Logis8cs Risk Modeling Market Segmenta8on Inventory Forecas8ng
32 ©MapR Technologies
The Cloud Leaders Pick MapR
Google chose MapR to provide Hadoop on Google
Compute Engine
Amazon EMR is the largest Hadoop provider in revenue
and # of clusters
MinuteSort Record 1.5 TB in 60 seconds
2103 nodes
34 ©MapR Technologies
MapR EdiDons
§ Control System § NFS Access § Performance § High Availability § Snapshots & Mirroring § 24 X 7 Support § Annual Subscrip8on
§ Control System § NFS Access § Performance § Unlimited Nodes § Free
Compute Engine
Also Available through:
§ All the Features of M5 § Simplified
Administra8on for HBase
§ Increased Performance § Consistent Low Latency § Unified Snapshots,
Mirroring
35 ©MapR Technologies
Hbase MapR M7
Agenda
37 ©MapR Technologies
Introducing MapR M7
§ An integrated system – Unified namespace for files and tables – Built-‐in data management & protec8on – No extra administra8on
§ Architected for reliability and performance – Fewer layers – Single hop to data – No compac8ons, low i/o amplifica8on – Seamless splits, automa8c merges – Instant recovery
38 ©MapR Technologies
M7: Remove Layers, Simplify
MapR M7
Take note! No JVM!
39 ©MapR Technologies
Binary CompaDble with HBase APIs
§ HBase applica8ons work "as is" with M7 – No need to recompile (binary compa8ble)
§ Can run M7 and HBase side-‐by-‐side on the same cluster – e.g., during a migra8on – can access both M7 table and HBase table in same program
§ Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-‐versa % hbase org.apache.hadoop.hbase.mapreduce.CopyTable -‐-‐new.name=/user/srivas/mytable oldtable
40 ©MapR Technologies
M7: No Master and No RegionServers
No extra daemons to manage
One hop to data Unified cache
No JVM problems
41 ©MapR Technologies
Region Assignment in Apache HBase None of this complexity is present in MapR M7
42 ©MapR Technologies
Unified Namespace for Files and Tables
$ pwd /mapr/default/user/dave $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -‐ls /user/dave Found 5 items -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 16 2012-‐09-‐28 08:34 /user/dave/file1 -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 22 2012-‐09-‐28 08:34 /user/dave/file2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:32 /user/dave/table1 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:33 /user/dave/table2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:38 /user/dave/table3
43 ©MapR Technologies
Tables for End Users
§ Users can create and manage their own tables – Unlimited # of tables
§ Tables can be created in any directory – Tables count towards volume and user quotas
§ No admin interven8on needed – I can create a file or a directory without opening a 8cket with admin team, why not a table?
– Do stuff on the fly, no stop/restart servers
§ Automa8c data protec8on and disaster recovery – Users can recover from snapshots/mirrors on their own
44 ©MapR Technologies
M7 – An Integrated System
45 ©MapR Technologies
M7 Compara8ve Analysis with
Apache HBase, Level-‐DB and a BTree
46 ©MapR Technologies
HBase Write AmplificaDon Analysis
§ Assume 10G per region, write 10% per day, grow 10% per week – 1G of writes – a~er 7 days, 7 files of 1G and 1file of 10G (only 1G is growth)
§ IO Cost – Wrote 7G to WAL + 7G to HFiles – Compac8on adds s8ll more • read: 17G (= 7 x 1G + 1 x 10G) • write: 11G write to new Hfile
– WAF – wrote 7G “for real” but actual disk IO a~er compac8on is read 17G + write 25G and that’s assuming no applica8on reads!
§ IO Cost of 1000 regions similar to above – read 17T, write 25T è major impact on node
§ Best prac8ce, limit # of regions/node à can’t fully u8lize storage
47 ©MapR Technologies
AlternaDve: Level-‐DB
§ Tiered, logarithmic increase – L1: 2 x 1M files – L2: 10 x 1M – L3: 100 x 1M – L4: 1,000 x 1M, etc
§ Compac8on overhead – avoids IO storms (i/o done in smaller increments of ~10M) – but significantly extra bandwidth compared to HBase
§ Read overhead is s8ll high – 10-‐15 seeks, perhaps more if the lowest level is very large – 40K -‐ 60K read from disk to retrieve a 1K record
48 ©MapR Technologies
BTree analysis § Read finds data directly, proven to be fastest – interior nodes only hold keys – very large branching factor – values only at leaves – thus index caches work – R = logN seeks, if no caching – 1K record read will transfer about logN blocks from disk
§ Writes are slow on inserts – inserted into correct place right away – otherwise read will not find it – requires btree to be con8nuously rebalanced – causes extreme random i/o in insert path – W = 2.5x + logN seeks if no caching
49 ©MapR Technologies
Log-‐Structured Merge Trees § LSM Trees reduce insert cost by deferring and batching index changes – If don't compact o~en, read perf is impacted – If compact too o~en, write perf is impacted
§ B-‐Trees are great for reads – but expensive to update in real-‐8me
Index Log
Index
Memory Disk
Write
Read
Can we combine both ideas? Writes cannot be done beker than W = 2.5x
write to log + write data to somewhere + update meta-‐data
50 ©MapR Technologies
M7 from MapR § Twis8ng BTree's – leaves are variable size (8K -‐ 8M or larger) – can stay unbalanced for long periods of 8me • more inserts will balance it eventually • automa8cally throkles updates to interior btree nodes
– M7 inserts "close to" where the data is supposed to go
§ Reads – Uses BTree structure to get "close" very fast • very high branching with key-‐prefix-‐compression
– U8lizes a separate lower-‐level index to find it exactly • updated "in-‐place” bloom-‐filters for gets, range-‐maps for scans
§ Overhead – 1K record read will transfer about 32K from disk in logN seeks
51 ©MapR Technologies
M7 provides Instant Recovery § Instead of having one WAL/region server or even one/region, we have many micro-‐WALs/region
§ 0-‐40 microWALs per region – idle WALs “compacted”, so most are empty – region is up before all microWALs are recovered – recovers region in background in parallel – when a key is accessed, that microWAL is recovered inline – 1000-‐10000x faster recovery
§ Never perform equivalent of HBase major or minor compac8on
§ Why doesn't HBase do this? M7 uses MapR-‐FS, not HDFS – No limit to # of files on disk – No limit to # open files – I/O path translates random writes to sequen8al writes on disk
53 ©MapR Technologies
M7: Fileservers Serve Regions
§ Region lives en8rely inside a container – Does not coordinate through ZooKeeper
§ Containers support distributed transac8ons – with replica8on built-‐in
§ Only coordina8on in the system is for splits – Between region-‐map and data-‐container – already solved this problem for files and its chunks
57 ©MapR Technologies
M7 Containers
§ Container holds many files – regular, dir, symlink, btree, chunk-‐map, region-‐map, … – all random-‐write capable
§ Container is replicated to servers – unit of resynchroniza8on
§ Region lives en8rely inside 1 container – all files + WALs + btree's + bloom-‐filters + range-‐maps
63 ©MapR Technologies
Other M7 Features
§ Smaller disk footprint – M7 never repeats the key or column name
§ Columnar layout – M7 supports 64 column families – in-‐memory column-‐families
§ Online admin – M7 schema changes on the fly – delete/rename/redistribute tables
§ Run MapReduce and tables on same cluster § UI: hbase shell, MCS GUI, maprcli
64 ©MapR Technologies
Thank you!
QuesDons?