dchug m7-30 apr2013

1 ©MapR Technologies

HBase and M7 Technical Overview Jim Fiori Senior Solu8ons Architect MapR Technologies

April2013


§  Background §  “a 3-‐hour tour” §  Early Hadoop fire-‐fight §  Big Data

Who am I?


Apache HBase MapR M7

Agenda


HBase Google BigTable Paper -‐ 2006 A sparse, distributed, persistent, indexed, and

sorted map OR

A NoSQL database OR

A Columnar data store


Key-‐Value Store

§  Row key –  Binary sortable value

§  Row content key (analogous to a column) –  Column family (string) –  Column qualifier (binary) –  Version/8mestamp (number)

§  A row key, column family, column qualifier, and version uniquely iden8fies a par8cular cell –  A cell contains a single binary value


A Row

Value1 Row Key Value2 Value3 Value4 ValueN …

C0 C1 C2 C3 C4 CN

Column Family Row Key Column

Qualifier Version Value2


Qualifier Version Value1


Qualifier Version ValueN

…


§  Weakly typed and schema-‐less (unstructured or perhaps semi-‐structured) –  Almost everything is binary

§  No constraints –  You can put any binary value in any cell –  You can even put incompa8ble types in two different instances of the same column family:column qualifier

§  Column (qualifiers) are created implicitly

§  Different rows can have different columns §  No transac8ons/no ACID –  Only unit of atomic opera8on is a single row

Not A TradiDonal RDBMS


§  APIs for querying (get), scanning, and upda8ng (put) –  Operate on row key, column family, qualifier, version, and values –  Can par8ally specify and will retrieve union of results •  if just specify row key, will get all values for it (with column family, qualifier) –  By default only largest version (most recent if 8mestamp) is returned

•  Specify row key and column family to get will retrieve all values for that row and column family

–  Scanning is just get over a range of row keys

§  Version – While defaults to a 8mestamp, any integer is acceptable

API


§  Rather than storing table rows linearly on disk and each row on disk as a single byte range with fixed size fields, store columns of row separately –  Very efficient storage for sparse data sets (NULL is free) –  Compression works beker on similar data –  Fetches of only subsets of row very efficient (less disk IO) –  No fixed size on column values –  No requirement to even define columns

§  Columns are grouped together into column families –  Basically a file on disk –  A unit of op8miza8on –  In Hbase, adding column is implicit, adding column family is explicit

Columnar


HBase Table Architecture §  Tables are divided into key ranges (regions) §  Regions are served by nodes (RegionServers) §  Columns are divided into access groups (columns families)

CF1 CF2 CF3 CF4 CF5

R1

R2

R3

R4


HBase Architecture


§  Data is stored in sorted order –  A table contains rows –  A sequence of rows are grouped together into a region •  A region consists of various files related to those rows and is loaded into a region server

•  Regions are stored in HDFS for high availability –  A single region server manages mul8ple regions •  Region assignment can change – load balancing, failures, etc.

§  Clients connect to tables –  HBase run8me transparently determines the region (based on key ranges) and contacts the appropriate region server

§  At any given 8me exactly one region server provides access to a region – Master region servers (with Zookeeper) manage that

Storage Model Highlights


§  Very scalable §  Easy to add region servers §  Easy to move regions around §  Scans are efficient –  Unlike hashing based models

§  Access via row key is very efficient –  Note: there are no secondary indexes

§  No schema, can store whatever you want when you want §  Strong consistency

§  Integrated with Hadoop – Map-‐Reduce on HBase is straighoorward –  HDFS/MapR-‐FS provides data replica8on

What’s Great About This?


§  Data from a region column family is stored in an HFile – An HFile contains row key:column qualifier:version:value entries

– Index at the end into the data – 64KB “blocks” by default §  Update – New value is wriken persistently to Write Ahead Log (WAL) – Cached in memory (MemStore) – When memory fills, write out new HFile

§  Read – Checks in memory, then all of the HFiles – Read data cached in memory

§  Delete – Create a tombstone record (purged at major compac8on)

Data Storage Architecture


Apache HBase HFile Structure

64Kbyte blocks are compressed

An index into the compressed blocks is created as a btree

Key-‐value pairs are laid out in increasing order

Each cell is an individual key + value -‐ a row repeats the key for each column


HBase Region OperaDon

§  Typical region size is a few GB, some8mes even 10G or 20G §  RS holds data in memory in a MemStore un8l full, then writes a new HFile –  Logical view of database constructed by layering these files, with the latest on top

Key range represented by this region

newest

oldest


HBase Read AmplificaDon §  When a get/scan comes in, all the files have to be examined –  schema-‐less, so where is the column? –  Done in-‐memory and does not change what's on disk •  Bloom-‐filters do not help in scans

newest

oldest

With 7 files, a 1K-‐record get() poten8ally takes about 30 seeks, 7 block fetches and decompressions, from HDFS. Even with the index in memory 7 seeks and 7 block fetches are required.


HBase Write AmplificaDon

§  To reduce the read-‐amplifica8on, HBase merges the HFiles periodically –  process called compac8on –  runs automa8cally when too many files –  usually turned off due to I/O storms which interfere with client access

–  and kicked-‐off manually on weekends

Major compac8on reads all files and merges into a single HFile


§  A persistent record of every update/insert in sequence order –  Shared by all regions on one region server – WAL files periodically rolled to limit size but older WALs s8ll needed – WAL file no longer needed once every region with updates in WAL file has flushed those from memory to an HFile •  Remember that more HFiles slow read path!

§  Must be replayed as part of recovery process since in memory updates are “lost” –  This is very expensive and delays bringing a region back online

WAL File


What’s Not So Good

Reliability • Complex coordina8on between ZK, HDFS, HBase Master, and Region Server during region movement

• Compac8ons disrupt opera8ons • Very slow crash recovery because of • Coordina8on complexity • WAL log reading (one log/server)

Business conDnuity • Many administra8ve ac8ons require down8me • Not well integrated into MapR-‐FS mirroring and snapshot func8onality


What’s Not So Good

Performance • Very long read/write path •  Significant read and write amplifica8on • Mul8ple JVMs in read/write path – GC delays!

Manageability • Compac8ons, splits and merges must be done manually (in reality)

•  Lots of “well known” problems maintaining reliable cluster – spliwng, compac8ons, region assignment, etc.

• Prac8cal limits on number of regions/region server and size of regions – can make it hard to fully u8lize hardware


Region Assignment in Apache HBase


Apache HBase on MapR

Limited data management, data protec8on and disaster recovery for tables.


HBase MapR M7 Containers

Agenda


MapR DistribuDon for Apache Hadoop

§  Complete Hadoop distribu8on

§  Comprehensive management suite

§  Industry-‐standard interfaces

§  Enterprise-‐grade dependability

§  Higher performance


MapR: The Enterprise Grade DistribuDon


One PlaVorm for Big Data

…

Batch

99.999% HA

Data Protec8on

Disaster Recovery

Scalability &

Performance Enterprise Integra8on

Mul8-‐tenancy

Map Reduce

File-‐Based Applica8ons SQL Database Search Stream

Processing

Interac8ve Real-‐8me

… Broad range of

applica8ons

Recommenda8on Engines Fraud Detec8on Billing Logis8cs Risk Modeling Market Segmenta8on Inventory Forecas8ng


The Cloud Leaders Pick MapR

Google chose MapR to provide Hadoop on Google

Compute Engine

Amazon EMR is the largest Hadoop provider in revenue

and # of clusters

MinuteSort Record 1.5 TB in 60 seconds

2103 nodes


MapR EdiDons

§  Control System §  NFS Access §  Performance §  High Availability §  Snapshots & Mirroring §  24 X 7 Support §  Annual Subscrip8on

§  Control System §  NFS Access §  Performance §  Unlimited Nodes §  Free

Compute Engine

Also Available through:

§  All the Features of M5 §  Simplified

Administra8on for HBase

§  Increased Performance §  Consistent Low Latency §  Unified Snapshots,

Mirroring


Hbase MapR M7

Agenda


Introducing MapR M7

§  An integrated system – Unified namespace for files and tables – Built-‐in data management & protec8on – No extra administra8on

§  Architected for reliability and performance – Fewer layers – Single hop to data – No compac8ons, low i/o amplifica8on – Seamless splits, automa8c merges –  Instant recovery


M7: Remove Layers, Simplify

MapR M7

Take note! No JVM!


Binary CompaDble with HBase APIs

§  HBase applica8ons work "as is" with M7 –  No need to recompile (binary compa8ble)

§  Can run M7 and HBase side-‐by-‐side on the same cluster –  e.g., during a migra8on –  can access both M7 table and HBase table in same program

§  Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-‐versa % hbase org.apache.hadoop.hbase.mapreduce.CopyTable -‐-‐new.name=/user/srivas/mytable oldtable


M7: No Master and No RegionServers

No extra daemons to manage

One hop to data Unified cache

No JVM problems


Region Assignment in Apache HBase None of this complexity is present in MapR M7


Unified Namespace for Files and Tables

$ pwd /mapr/default/user/dave $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -‐ls /user/dave Found 5 items -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 16 2012-‐09-‐28 08:34 /user/dave/file1 -‐rw-‐r-‐-‐r-‐-‐ 3 mapr mapr 22 2012-‐09-‐28 08:34 /user/dave/file2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:32 /user/dave/table1 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:33 /user/dave/table2 trwxr-‐xr-‐x 3 mapr mapr 2 2012-‐09-‐28 08:38 /user/dave/table3


Tables for End Users

§  Users can create and manage their own tables –  Unlimited # of tables

§  Tables can be created in any directory –  Tables count towards volume and user quotas

§  No admin interven8on needed –  I can create a file or a directory without opening a 8cket with admin team, why not a table?

–  Do stuff on the fly, no stop/restart servers

§  Automa8c data protec8on and disaster recovery –  Users can recover from snapshots/mirrors on their own


M7 – An Integrated System


M7 Compara8ve Analysis with

Apache HBase, Level-‐DB and a BTree


HBase Write AmplificaDon Analysis

§  Assume 10G per region, write 10% per day, grow 10% per week –  1G of writes –  a~er 7 days, 7 files of 1G and 1file of 10G (only 1G is growth)

§  IO Cost – Wrote 7G to WAL + 7G to HFiles –  Compac8on adds s8ll more •  read: 17G (= 7 x 1G + 1 x 10G) •  write: 11G write to new Hfile

– WAF – wrote 7G “for real” but actual disk IO a~er compac8on is read 17G + write 25G and that’s assuming no applica8on reads!

§  IO Cost of 1000 regions similar to above –  read 17T, write 25T è major impact on node

§  Best prac8ce, limit # of regions/node à can’t fully u8lize storage


AlternaDve: Level-‐DB

§  Tiered, logarithmic increase –  L1: 2 x 1M files –  L2: 10 x 1M –  L3: 100 x 1M –  L4: 1,000 x 1M, etc

§  Compac8on overhead –  avoids IO storms (i/o done in smaller increments of ~10M) –  but significantly extra bandwidth compared to HBase

§  Read overhead is s8ll high –  10-‐15 seeks, perhaps more if the lowest level is very large –  40K -‐ 60K read from disk to retrieve a 1K record


BTree analysis §  Read finds data directly, proven to be fastest –  interior nodes only hold keys –  very large branching factor –  values only at leaves –  thus index caches work –  R = logN seeks, if no caching –  1K record read will transfer about logN blocks from disk

§  Writes are slow on inserts –  inserted into correct place right away –  otherwise read will not find it –  requires btree to be con8nuously rebalanced –  causes extreme random i/o in insert path – W = 2.5x + logN seeks if no caching


Log-‐Structured Merge Trees §  LSM Trees reduce insert cost by deferring and batching index changes –  If don't compact o~en, read perf is impacted –  If compact too o~en, write perf is impacted

§  B-‐Trees are great for reads –  but expensive to update in real-‐8me

Index Log

Index

Memory Disk

Write

Read

Can we combine both ideas? Writes cannot be done beker than W = 2.5x

write to log + write data to somewhere + update meta-‐data


M7 from MapR §  Twis8ng BTree's –  leaves are variable size (8K -‐ 8M or larger) –  can stay unbalanced for long periods of 8me •  more inserts will balance it eventually •  automa8cally throkles updates to interior btree nodes

– M7 inserts "close to" where the data is supposed to go

§  Reads –  Uses BTree structure to get "close" very fast •  very high branching with key-‐prefix-‐compression

–  U8lizes a separate lower-‐level index to find it exactly •  updated "in-‐place” bloom-‐filters for gets, range-‐maps for scans

§  Overhead –  1K record read will transfer about 32K from disk in logN seeks


M7 provides Instant Recovery §  Instead of having one WAL/region server or even one/region, we have many micro-‐WALs/region

§  0-‐40 microWALs per region –  idle WALs “compacted”, so most are empty –  region is up before all microWALs are recovered –  recovers region in background in parallel – when a key is accessed, that microWAL is recovered inline –  1000-‐10000x faster recovery

§  Never perform equivalent of HBase major or minor compac8on

§  Why doesn't HBase do this? M7 uses MapR-‐FS, not HDFS –  No limit to # of files on disk –  No limit to # open files –  I/O path translates random writes to sequen8al writes on disk


M7: Fileservers Serve Regions

§  Region lives en8rely inside a container – Does not coordinate through ZooKeeper

§  Containers support distributed transac8ons – with replica8on built-‐in

§  Only coordina8on in the system is for splits –  Between region-‐map and data-‐container –  already solved this problem for files and its chunks


M7 Containers

§  Container holds many files – regular, dir, symlink, btree, chunk-‐map, region-‐map, … – all random-‐write capable

§  Container is replicated to servers – unit of resynchroniza8on

§  Region lives en8rely inside 1 container – all files + WALs + btree's + bloom-‐filters + range-‐maps


Other M7 Features

§  Smaller disk footprint – M7 never repeats the key or column name

§  Columnar layout – M7 supports 64 column families –  in-‐memory column-‐families

§  Online admin – M7 schema changes on the fly – delete/rename/redistribute tables

§  Run MapReduce and tables on same cluster §  UI: hbase shell, MCS GUI, maprcli


Thank you!

QuesDons?

dchug m7-30 apr2013

Technology