breaking with relational dbms and dating with hbase

1

Gaurav KohliXebiaBreaking with DBMS and

Dating with

2

me

Gaurav [email protected]

ConsultantXebia IT Architects

3

Why are we here ?

Something about RDBMS

Limitations of RDBMS

Why Hbase or any NoSql solution

Overview of Hbase

Specific Use cases

Paradigm shift in Schema Design

Architecture of Hbase

Hbase Interface – Java API, Thrift

Conclusion

4

Databases

5

Relational Databases have a lot of

6

Data Set going into PetaBytes

RDBMS don't scale inherently Scale up/Scale out ( Load Balancing + Replication)

Hard to shard / partition

Both read / write throughput not possible Transactional / Analytical databases

Specialized Hardware …... is very expensive Oracle clustering

7

Master

Slave

Replication

8

MySQL master becomes a problem All Slaves must have the same write capacity as master Single point of failure, no easy failover

Master

Reads

Writes

Slave nodes

9

Master Master

Slave

Replication

12

2006.11 Google releases paper on BigTable

2007.2 Initial HBase prototype created as Hadoop contrib.

2007.10 First usable HBase

2008.1 Hadoop become Apache top-level project and HBase becomes

subproject 2010.5~

Hbase becomes Apache top-level project 2010.6

Hbase 0.26.5 released.

2010.10

HBase 0.89.2010092 – third developer release

13

Distributed uses HDFS for storage

Column-Oriented

Multi-Dimensional versions

High-Availability

High-Performance

Storage System

14

A Sql Database

No Joins, no query engine, no datatypes, no sql No Schema

Denormalized data

Wide and sparsely populated data structure(key-value)

No DBA needed

Hbase is

15

Bigness Big data, big number of users, big number of computers

Massive write performance Facebook needs 135 billion messages a month Twitter stores 7 TB data per day

Fast key-value access

Write availability

No Single point of failure

16

Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, etc.

Real-time inserts, updates, and queries.

Fraud detection by comparing transactions to known patterns in real-time.

Analytics - Use MapReduce, Hive, or Pig to perform analytical queries

Specific

17

Column-oriented database

Table are sorted by Row

Table schema only defines Column families column family can have any number of columns

Each cell value has a timestamp

20

Sorted Map(

RowKey, List(

SortedMap(Column, List(

value, Timestamp)

))

)SortedMap(RowKey,List(SortedMap(Column,List(Value,Timestamp)))

21

A BIG SORTED MAP Row Key+ Column Key + timestamp => value

Row Key Column Key Timestamp Value

1 info:name 1273516197868 Gaurav

1 info:age 1273871824184 28

1 info:age 1273871823022 34

1 info:sex 1273746281432 Male

2 info:name 1273863723227 Harsh

3 Info:name 1273822456433 Raman

2 Versionsof this row

Timestamp is a long valueColumn Qualifier/Name

Sorted by Row key andcolumn key

Column family

Student table

22

Example of a Student and Subject

Student TablePK id

nameagesex


Subject TablePK id

titleintroductionteacher_id

Student-Subject Tablestudent_id

subject_id

type

m n

23


RDBMS

key name age sex1 Gaurav 28 Male

id title introduction teacher_id1 Hbase Hbase is cool 10

Student table

Subject table

student_id subject_id type

1 1 elective

Student-Subject table

24

Hbase

Student-Subject schema - Hbase

Row Key Column family Column Keys

student_id info name, age, sex

student_id subjects Subject Id's as qualifier(key)

Row Key Column family Column Keyssubject_id info title, introduction, teacher_id

subject_id students Student id's as qualifier(key)

Student table

Subject table

25

Hbase

key info subjects1 info:name=Gaurav

info:age=28info:sex=Male

subjects:1=”elective”subjects:2=”main”

key info students1 info:title=Hbase

info:introduction=Hbase is coolinfo:teacher_id=10

students:1students:2

Student-Subject schema - HbaseStudent table

Subject table

26

Attribute Possible Values Default

COMPRESSION NONE,GZ,LZO NONE

VERSIONS 1+ 3

TTL 1-2147483647(seconds) 2147483647

BLOCKSIZE 1 byte – 2 GB 64k

IN_MEMORY true,false false

BLOCKCACHE true,false true

27

Region: Contiguous set of lexicographically sorted rows

hbase.hregion.max.filesize (default:256 Mb) Region hosted by Region Servers

Each Table is partitioned into Regions

28

Regions and

row200

row201

row500

row1

new row

29

Regions and

row200

row201

row350

row1

row 351

row 501

30

Master

Zookeeper

RegionServers

HDFS

MapReduce

32

– Java API, Thrift...

33

– Java API, Thrift... Java

Thrift ( Ruby, Php, Python, Perl, C++... )

REST

Groovy DSL

MapReduce

Hbase Shell

34

– Java API, Thrift... Java

Get Put Delete Scan IncrementalColumnValue

36

Hbase v/s RDBMS Not a replacement Solves only a small subset(~5%)

37

Where Sql makes life easy Joining Secondary Indexing Referential Integrity (updates) ACID

Where Hbase makes life easy Dataset scale Read/Write scale

Replication Batch analysis

40

Hbase Apache (http://hbase.apache.org/)

Hbase Wiki (wiki.apache.org/hadoop/Hbase)

Hbase blog (blog.hbase.org)

Images from Google Search

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html

breaking with relational dbms and dating with hbase

Technology

usable hbase

hbase apache http

hbase vs rdbms

hbase wiki wiki

orghadoophbase hbase

initial hbase prototype

row column key1 info

column families column